Work with personal data
Personal data is all information related to an identified or identifiable person. Handling such data requires special precautions in order to be compliant with the law (see this Research Office page).
If a project involves personal data it is important to:
- Document the processes and the uses of the data (data management plan, metadata, code used to process data, etc.)
- Get the approval for the project:
- Get valid consent of individuals. This consent must be given expressly in the case of processing of sensitive personal data.
- Inform participants on their personal data
- Anonymize data, as soon as the purpose of the processing permits
- Secure data against any violation
- Avoid transferring data abroad (be sure to comply with the law if you want to do it)
Why data anonymization matters
Data anonymization offers several advantages. In particular, it enables to:
- Prevent violations and missuse of the data
- Comply with legal obligations
- Publish the data
- Make data reusable
Pseudonymization vs anonymization
- Pseudonymization: data directly identifying people (names, ip addresses, phone number, etc.) is replaced by identifiers or crypted. The key of the masked data is kept separately and securely. Pseudonymization is a good practice in order to work on personal data. It limits the risks related to a data leak. It allows to retrieve the original data, too.
- Anonymization: when time comes to publish this data, pseudonymization is seldom enough. By crossing the data, it is often possible to reidentify persons in pseudonymized datasets. Several methods prevent these risks (see below). Anonymization often means a loss of information and is not reversible. Completely anonymized data is not anymore considered as personal data and can be published.
Data anonymization is all about the balance between mitigating the risk of reidentification and preserving the utility of the data. The principle of proportionality applies here.
Dilemma of data anonymization
- Removing: simply suppressing the data. It is often the appropriate solution to process direct identifiers like names, phone numbers, email addresses, ip addresses, etc. To suppress part of the outlier records is often necessary too.
- Encrypting: preserve the whole data by encrypting the identification data and keep the key secure. It is a good option for long term preservation but not for publishing the data.
- Generalizing: if the data is too specific and has unique records, the variables may be generalized in order to have less granularity.
- Shuffling: sometimes, it is possible to shuffle data over one or several columns without compromising the utility of the data. For example, if you shuffle ip addresses, you can still analyse globally these addresses but you cannot associate a record with the correct ip address.
- Adding fake data: it is possible to add fake data to a dataset and to preserve correlation factors for example. The presence of fake data may prevent individual records to be identified even if we know that a specific record is part of the dataset.
There are several variables to evaluate the anonymization level of a dataset.
- k-anonymity: a release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release. source
- l-diversity: l-diversity is an extension of the k-anonymity model which reduces the granularity of data representation using techniques including generalization and suppression such that any given record maps onto at least k-1 other records in the data. source
- t-closeness: t-closeness is a refinement of l-diversity group based anonymization that is used to preserve privacy in data sets by reducing the granularity of a data representation. source
- Differential privacy: differential privacy is a process that introduces randomness into the data, for example by adding fake data or shuffling them.
- sdcMicro: Statistical Disclosure Control Methods for Anonymization of Microdata and Risk Estimation (R package)
- ARX Data Anonymization Tool: Java application
- ARGUS: Java application
Need more help?
You can contact us, if you need further advices about data anonymization.
For ethical and legal questions the Research Office is the main respondant. It provides several useful resources:
- Dedicated webpage to Research involving work with personal data
- Ethical issues checklist (connection needed)
- Raghunathan, Balaji. (2013). The complete book of data anonymization : From planning to implementation. Boca Raton: CRC Press. [online at epfl]
- Jordi Soria-Comas, Josep Domingo-Ferrer, & David Sánchez. (2016). Database anonymization : Privacy models, data utility, and microaggregation-based inter-model connections. Morgan & Claypool. [online at epfl]
- Khaled El Emam Luk Arbuckle. (2013). Anonymizing health data : Case studies and methods to get you started. O’Reilly Media. [online at epfl]