Data Anonymization | Research Data Management

Protecting sensitive data is not only a legal obligation—it’s a commitment to ethical, trustworthy, and responsible research.

Sharing research data is an important scholarly practice. Still, when working with sensitive or personal information, appropriate de-identification techniques are essential to protect participant confidentiality and comply with ethical and legal requirements.

Download our Introduction to Data Anonymization guide as a PDF

See our hands-on workshop with exercises on the topic of Introduction to Data Anonymization on GitHub

Introduction to Data Anonymization

Sensitive data requires careful handling to protect participant privacy and confidentiality, and to ensure compliance with ethical and legal obligations. Improper management or disclosure of sensitive data can expose individuals, communities, and animals to significant harm.

What is sensitive data?

Under UBC’s Information Technology Standard U1, sensitive data is information and data classified as medium risk, high risk, or very high risk. In broader research contexts, sensitive data could be defined as “information that must be safeguarded against unwarranted access or disclosure” and may apply to both human and animal research.

Sensitive data may involve personal health details, geographical information about endangered species, or information governed by institutional policy.

Why do we anonymize data?

Data anonymization reduces the risk of harm by preventing the re-identification of participants. Re-identification may occur when an individual can be isolated in a dataset and linked to information that identifies them with reasonable effort. The potential harm depends on population characteristics, research topic, and context.

Types of identifying information

Datasets can contain four categories of identifying information:

Direct identifiers
Indirect identifiers
Non-identifiers
Hidden identifiers

Each category carries different risks and requires different approaches to anonymization.

Direct identifiers

Direct identifiers immediately reveal a participant’s identity and must always be removed before sharing or publishing data. Examples include:

Names and initials
Full or partial addresses
Email addresses
Vehicle identifiers
Biometric data
Audio recordings containing identifiable voices

Indirect identifiers

Indirect identifiers do not identify participants on their own but may do so when combined with other variables. Removing or modifying indirect identifiers requires consideration of context, what can be reasonably inferred, and the size of the potentially identifiable group. Examples include:

Social media–related variables
Gender identity
Income
Geographic variables
Occupation or industry
Ethnicity or immigration-related variables

Non-identifiers

Non-identifiers generally do not identify a participant, such as ordinal ratings or short-term physiological measures. However, even non-identifiers may need additional protection if they relate to sensitive behaviours or characteristics.

Hidden identifiers

Hidden identifiers are pieces of information that appear non-identifying on their own but become identifying when combined with contextual information. Awareness of these risks is important in ensuring an anonymized dataset remains non-identifiable.

Who may be harmed by data disclosure?

Harm from disclosure may disproportionately affect vulnerable or marginalized groups, including:

Racialized communities
Lower-income groups
Children and teens
Individuals connected to sensitive personal topics such as health conditions, substance use, or private family issues.

k-Anonymity and risk assessment

One method to assess anonymization is k-anonymity, which states that no participant should be distinguishable from fewer than k individuals based on identifying variables. For example, if k = 5, at least four other participants must share the same combination of indirect identifiers.

While useful, k-anonymity is not foolproof, and some re-identification risk always remains. Software tools such as ARX and Amnesia can support anonymization of tabular data.

Consent language and ethical requirements

Informed consent must specify how data will be handled during the study and in future use. Many journals require datasets to be made available, and consent language must anticipate this requirement. In Canada, the Tri‑Council Policy Statement (TCPS 2) outlines ethics requirements for research involving humans. Institutional Research Ethics Boards review consent materials to ensure that privacy and confidentiality are addressed appropriately.

Researchers may use de-identification, anonymization, or pseudonymization to protect participant privacy.

Future use of data

When writing an application to a REB specific to the area of study, there are many things to consider. One important consideration is “what happens to the sensitive data in the future?”, which refers to how the sensitive data will be dealt with after the research project is completed.

We can look at UBC’s Behavioural Research Ethics Board (BREB) guidance notes for the Future Use of Data (subsection 8.6) as an example:

“Describe any known future use of the data beyond the conclusion of this research project, and indicate whether participant consent will be obtained in the current consent procedure or if the participant will be contacted later to obtain consent. Either possibility must be described in the consent process. If consent is to be obtained now, future use of data must also be described in full in the consent form. If consent will be obtained later, an amendment will be needed that includes the full details and updated consent form before the additional use of data begins.”

One of the future uses of sensitive data is making it openly accessible and available. As a part of some funding and publishing requirements, de-identified data and research findings may be required to be deposited in a repository. Participants must be informed when the data will be made available and accessible.

More language to inform participants on this topic is specified in UBC BREB subsection 8.6, “Access to Research Data”.

The how-to of data anonymization

There are several ways of approaching de-identification, each of which has benefits and drawbacks:

Need help? Contact research.data@ubc.ca