Data anonymization

Anonymization vs Pseudonymization

According to the Federal Data Protection Commissioner, personal data is pseudonymized when it is replaced by a code (pseudonym), while it is anonymized when all identifying data is removed.

Pseudonymization is reversible while anonymization is definitive.

Irreversibly anonymized data, which no longer allow the re-identification of a person, are not subject to the regulations on the protection of personal data.

In reality, it is very difficult to guarantee 100% anonymization and any technique has its advantages and limitations and will always involve some risk of re-identification of the data subject(s).

In the section below you’ll find more details, especially anonymization techniques based on a document published by The European Union Article 29 Data Protection Working Party.

Disclaimer : The text below has been developed by the Università della Svizzera italiana in collaboration with the University of Neuchâtel. Note that all the text here, except third party contents (e.g. quotes), is published under the Creative Commons Attribution Share Alike 4.0 International License. To view a copy of this license, visit this page.

Anonymization techniques

“A technique is considered robust based on three criteria :

  • is it still possible to single out an individual
  • is it still possible to link records relating to an individual
  • can information be inferred concerning an individual?

These are defined by the European Union Article 29 Data Protection Working Party as risks of identification.

They conclude that anonymisation techniques can provide privacy guarantees and may be used to generate efficient anonymisation processes, but only if their application is engineered appropriately – which means that the prerequisites (context) and the objective(s) of the anonymisation process must be clearly set out in order to achieve the targeted anonymisation while producing some useful data. The optimal solution should be decided on a case-by-case basis, possibly by using a combination of different techniques, while taking into account the practical recommendations developed by Article 29.

Broadly speaking there are two main approaches to anonymization: 

  • Approach based on randomization, 
  • Approach based on generalization.”

Randomization

“Randomization is a family of techniques that alters the veracity of the data in order to remove the strong link between the data and the individual. If the data are sufficiently uncertain then they can no longer be referred to a specific individual. Randomization by itself will not reduce the singularity of each record as each record will still be derived from a single data subject but may protect against inference attacks/risks.”

Different techniques may be combined to avoid re-identification of individual subjects. Randomization techniques can also be combined with generalization techniques to provide stronger privacy guarantees. 

“The technique of noise addition is especially useful when attributes may have an important adverse effect on individuals and consists of modifying attributes in the dataset such that they are less accurate whilst retaining the overall distribution. When processing a dataset, an observer will assume that values are accurate but this will only be true to a certain degree. As an example, if an individual’s height was originally measured to the nearest centimetre the anonymised dataset may contain a height accurate to only ±10cm. If this technique is applied effectively, a third-party will not be able to identify an individual nor should he be able to repair the data or otherwise detect how the data have been modified.

Noise addition will commonly need to be combined with other anonymisation techniques such as the removal of obvious attributes and quasi-identifiers. The level of noise should depend on the necessity of the level of information required and the impact on individuals’ privacy as a result of disclosure of the protected attributes.”

“Permutation consists in shuffling the values of attributes in a table, so that some of them are artificially linked to different data subjects, is useful when it is important to retain the exact distribution of each attribute within the dataset. (…) Permutation techniques alter values within the dataset by just swapping them from one data to another. Such swapping will ensure that range and distribution of values will remain the same, but correlations between values and individuals will not. If two or more attributes have a logical relationship or statistical correlation and are permuted independently, such a relationship will be destroyed. It may therefore be important to permute a set of related attributes so as to not to break the logical relationship, otherwise an attacker could identify the permuted attributes and reverse the permutation.

For instance, if we consider a subset of attributes in a medical dataset such as “reasons for hospitalization/symptoms/department in charge”, a strong logical relationship will link the values in most cases and permutation of only one of the values would thus be detected and could even be reversed.”

“Differential privacy falls within the family of randomization techniques, with a different approach: while, in fact, noise insertion comes into play beforehand when dataset is supposed to be released, differential privacy can be used when the data controller generates anonymised views of a dataset whilst retaining a copy of the original data. Such anonymized views would typically be generated through a subset of queries for a particular third party. The subset includes some random noise deliberately added ex-post. Differential privacy tells the data controller how much noise he needs to add, and in which form, to get the necessary privacy guarantees.”

Generalization

“Generalization is the second family of anonymisation techniques. This approach consists of generalizing, or diluting, the attributes of data subjects by modifying the respective scale or order of magnitude (i.e. a region rather than a city, a month rather than a week). Whilst generalization can be effective to prevent singling out, it does not allow effective anonymisation in all cases; in particular, it requires specific and sophisticated quantitative approaches to prevent linkability and inference.”

“Aggregation and K-anonymity techniques aim at preventing a data subject from being singled out by grouping them with, at least, k other individuals. To achieve this, the attribute values are generalized to an extent that each individual shares the same value.

For example, by lowering the granularity of a location from a city to a country a higher number of data subjects are included. Individual dates of birth can be generalized into a range of dates, or grouped by month or year.

Other numerical attributes (e.g. salaries, weight, height, or the dose of a medicine) can be generalized by interval values (e.g. salary €20,000 – €30,000). These methods may be used when the correlation of punctual values of attributes may create quasi- identifiers.”

European Union Article 29 Data Protection Working Party0829/14/EN WP216, Opinion 05/2014 on Anonymisation Techniques, 2014, Page 12 of PDF

Strengths and Weaknesses of the Techniques

Is Singling out still a risk?Is linkability still a risk ?Is Inference still a risk ?
PseudonymisationYesYesYes
Noise AdditionYesMay notMay not
SubstitutionYesYesMay not
Agregation or K-anonymityNoYesYes
L-diversityNoYesMay not
Differential privacyMay notMay notMay not
 

More details

For more details on anonymization techniques, please refer to Article 29 Data Protection Working Party.

What be aware of / pay attention to in data protection?

Believing that a pseudonymised dataset is anonymised: Data controllers often assume that removing or replacing one or more attributes is enough to make the dataset anonymous. Many examples have shown that this is not the case; simply altering the ID does not prevent someone from identifying a data subject if quasi-identifiers remain in the dataset, or if the values of other attributes are still capable of identifying an individual. In many cases it can be as easy to identify an individual in a pseudonymised dataset as with the original data. Extra steps should be taken in order to consider the dataset as anonymised, including removing and generalising attributes or deleting the original data or at least bringing them to a highly aggregated level.”