By anonymizing data, sensitive information can be used responsibly for public benefit.
Protecting research participant privacy is a critical responsibility when working with sensitive or personal data. One of the most effective strategies to safeguard this information, while still enabling sharing and reuse, is through data anonymization.
What is data anonymization?
Anonymization is the process of permanently removing or altering personally identifiable information (PII) from a dataset so that individuals cannot be identified, either directly or indirectly. Once anonymized, the data should be non-identifiable and irreversible.
This differs from pseudonymization, where identifiers are replaced with coded values but can still be reconnected to the individual using a separate key.
Why anonymize data?
To comply with ethical research standards and legal regulations such as BC’s FIPPA and TCPS2.
To share data openly while preserving participant privacy.
To reduce institutional risk and safeguard your research reputation.
To meet Tri-Agency expectations for data stewardship and access.
Techniques for anonymizing data
Effective anonymization often involves multiple techniques used in combination. Common methods include:
Suppression: Removing identifying variables (e.g., name, address, health card number).
Generalization: Replacing specific data with broader categories (e.g., exact age with age range).
Masking: Obscuring data with random characters (e.g., email with xxxx@domain.com).
Data Perturbation: Adding noise to data to protect confidentiality while maintaining analytical value (e.g. adding random noise of plus/minus $1,000 to an income variable).
Advanced statistical methods like k-anonymity, l-diversity, and differential privacy offer stronger protections and should be considered when working with complex or high-risk datasets.
Practical examples for anonymizing data
Anonymizing data removes all links to the individual, as well as links across datasets. However, as with all de-identification methods, it may still be possible to re-identify individuals through indirect identifiers and/or links to related datasets.
For example, the following shows a small section of a dataset containing identifiers:
| Name | Address | Postal code | Year of birth | Gender | Occupation | Salary |
| Sally Xi | 123 City Roadway, Vancouver, BC | V5V 1P2 | 1970 | Female | Manager | 90,000 |
| Sam Cooper | 4576 Town Way, Smalltown, BC | V8A 1A5 | 1982 | Male | Electrician | 65,000 |
An anonymized version of that dataset might look like this:
| Postal code | Year of birth | Gender | Occupation | Salary |
| V5V 1P2 | 1970 | Female | Manager | 90,000 |
| V8A 1A5 | 1982 | Male | Electrician | 65,000 |
In some cases, this might be enough to ensure that the data is not re-identified. However, often the anonymized data may be easily re-identified. For example, if there are not many electricians in the V8A 1A5 postal code, then there is a strong risk of re-identification for the data related to Sam Cooper.
Open source and free software tools for anonymizing data
Researchers are increasingly using algorithm-based tools to help anonymize their data and manage the risk of reidentifying their anonymized data. Examples of open source and free anonymization tools include:
-
A powerful open-source tool that supports k-anonymity, l-diversity, t-closeness, and differential privacy.
Suitable for complex de-identification workflows and risk analysis.
-
Developed by the EU’s OpenAIRE, Amnesia is a free web-based tool for anonymizing structured data via generalization and suppression.
Supports k-anonymity and l-diversity
Need help? Contact research.data@ubc.ca