Personal data protection and anonymization

Work with personal data

Personal data is all information related to an identified or identifiable person. Handling such data requires special precautions in order to be compliant with the law (see this Research Office page).

If a project involves personal data it is important to:

  • Document the processes and the uses of the data (data management plan, metadata, code used to process data, etc.)
  • Get the approval for the project:
    • from HREC (EPFL Human Research Ethics Committee) or
    • from CER-VD (Commission cantonale d’éthique de la recherche sur l’être humain) for most medical studies.
  • Get valid consent of individuals. This consent must be given expressly in the case of processing of sensitive personal data.
  • Inform participants on their personal data
  • Anonymize data, as soon as the purpose of the processing permits
  • Secure data against any violation
  • Avoid transferring data abroad (be sure to comply with the law if you want to do it)

Data anonymization

Why data anonymization matters

Data anonymization offers several advantages. In particular, it enables to:

  • Prevent violations and missuse of the data
  • Comply with legal obligations
  • Publish the data
  • Make data reusable

Pseudonymization vs anonymization

  • Pseudonymization: data directly identifying people (names, ip addresses, phone number, etc.) is replaced by identifiers or crypted. The key of the masked data is kept separately and securely. Pseudonymization is a good practice in order to work on personal data. It limits the risks related to a data leak. It allows to retrieve the original data, too.
  • Anonymization: when time comes to publish this data, pseudonymization is seldom enough. By crossing the data, it is often possible to reidentify persons in pseudonymized datasets. Several methods prevent these risks (see below). Anonymization often means a loss of information and is not reversible. Completely anonymized data is not anymore considered as personal data and can be published.

Data anonymization is all about the balance between mitigating the risk of reidentification and preserving the utility of the data. The principle of proportionality applies here.

Dilemma of data anonymization

Methods

  • Removing: simply suppressing the data. It is often the appropriate solution to process direct identifiers like names, phone numbers, email addresses, ip addresses, etc. To suppress part of the outlier records is often necessary too.
  • Encrypting: preserve the whole data by encrypting the identification data and keep the key secure. It is a good option for long term preservation but not for publishing the data.
  • Generalizing: if the data is too specific and has unique records, the variables may be generalized in order to have less granularity.
  • Shuffling: sometimes, it is possible to shuffle data over one or several columns without compromising the utility of the data. For example, if you shuffle ip addresses, you can still analyse globally these addresses but you cannot associate a record with the correct ip address.
  • Adding fake data: it is possible to add fake data to a dataset and to preserve correlation factors for example. The presence of fake data may prevent individual records to be identified even if we know that a specific record is part of the dataset.

There are several variables to evaluate the anonymization level of a dataset.

  • k-anonymity: a release of data is said to have the k-anonymity property if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appear in the release. source
  • l-diversity: l-diversity is an extension of the k-anonymity model which reduces the granularity of data representation using techniques including generalization and suppression such that any given record maps onto at least k-1 other records in the data. source
  • t-closeness: t-closeness is a refinement of l-diversity group based anonymization that is used to preserve privacy in data sets by reducing the granularity of a data representation. source
  • Differential privacy: differential privacy is a process that introduces randomness into the data, for example by adding fake data or shuffling them.

Tools

Need more help?

You can contact us, if you need further advices about data anonymization.

For ethical and legal questions the Research Office is the main respondant. It provides several useful resources:

  • Dedicated webpage to Research involving work with personal data
  • Ethical issues checklist (connection needed)

Bibliography:

  • Raghunathan, Balaji. (2013). The complete book of data anonymization : From planning to implementation. Boca Raton: CRC Press. [online at epfl]
  • Jordi Soria-Comas, Josep Domingo-Ferrer, & David Sánchez. (2016). Database anonymization : Privacy models, data utility, and microaggregation-based inter-model connections. Morgan & Claypool. [online at epfl]
  • Khaled El Emam Luk Arbuckle. (2013). Anonymizing health data : Case studies and methods to get you started. O’Reilly Media. [online at epfl]