Data De-identification | Research Data Management

Protecting sensitive data is not only a legal obligation—it’s a commitment to ethical, trustworthy, and responsible research.

Sharing research data is an important scholarly practice. Still, when working with sensitive or personal information, appropriate de-identification techniques are essential to protect participant confidentiality and comply with ethical and legal requirements.

Download our introduction to data de‑identification guide as a PDF

See our hands-on workshop with exercises on the topic of Introduction to data de‑identification on GitHub

Introduction to data de‑identification

While sharing research data supports transparency and reproducibility, researchers must take careful steps to ensure data is managed ethically, responsibly, and securely. Data de‑identification involves removing or modifying identifying information to reduce the risk that individuals, communities, or animals can be re‑identified, helping protect privacy and confidentiality while complying with ethical and legal obligations. If sensitive information is not properly de‑identified and becomes exposed, it may lead to significant harm, including the unintended disclosure of identities and potential negative impacts on those represented in the research.

What is the difference between data anonymization and data de-identification?

Data de-identification removes or masks direct identifiers such as names and ID numbers, while data anonymization goes further by transforming a dataset so thoroughly that re-identification is considered impossible even when combined with other data sources.

What is sensitive data?

Under UBC’s Information Technology Standard U1, sensitive data is information and data classified as medium risk, high risk, or very high risk. In broader research contexts, sensitive data could be defined as “information that must be safeguarded against unwarranted access or disclosure” and may apply to both human and animal research.

Sensitive data may involve personal health details, geographical information about endangered species, or information governed by institutional policy.

Why do we de‑identify data?

Data de‑identification reduces the risk of harm by limiting the chances that participants can be re‑identified in a dataset. Re‑identification may occur when an individual can be isolated and linked to identifying information through reasonable effort or by combining multiple data points. The level of potential harm varies depending on population characteristics, the research topic, and the broader context in which the data is used.

Types of identifying information

Datasets can contain four categories of identifying information:

Direct identifiers
Indirect identifiers
Non-identifiers
Hidden identifiers

Each category carries different risks and requires different approaches to de‑identification.

Direct identifiers

Direct identifiers immediately reveal a participant’s identity and must always be removed before sharing or publishing data. Examples include:

Names and initials
Full or partial addresses
Email addresses
Vehicle identifiers
Biometric data
Audio recordings containing identifiable voices

Indirect identifiers

Indirect identifiers do not identify participants on their own but may do so when combined with other variables. Removing or modifying indirect identifiers requires consideration of context, what can be reasonably inferred, and the size of the potentially identifiable group. Examples include:

Social media–related variables
Gender identity
Income
Geographic variables
Occupation or industry
Ethnicity or immigration-related variables

Non-identifiers

Non-identifiers generally do not identify a participant, such as ordinal ratings or short-term physiological measures. However, even non-identifiers may need additional protection if they relate to sensitive behaviours or characteristics.

Hidden identifiers

Hidden identifiers are pieces of information that appear non-identifying on their own but become identifying when combined with contextual information. Awareness of these risks is important in ensuring an anonymized dataset remains non-identifiable.

Who may be harmed by data disclosure?

Harm from disclosure may disproportionately affect vulnerable or marginalized groups, including:

Racialized communities
Lower-income groups
Children and teens
Individuals connected to sensitive personal topics such as health conditions, substance use, or private family issues.

k-Anonymity and risk assessment

One method to assess de‑identification is k-anonymity, which states that no participant should be distinguishable from fewer than k individuals based on identifying variables. For example, if k = 5, at least four other participants must share the same combination of indirect identifiers.

While useful, k-anonymity is not foolproof, and some re-identification risk always remains. Software tools such as ARX and Amnesia can support better de‑identification of tabular data.

Consent language and ethical requirements

Informed consent must specify how data will be handled during the study and in future use. Many journals require datasets to be made available, and consent language must anticipate this requirement. In Canada, the Tri‑Council Policy Statement (TCPS 2) outlines ethics requirements for research involving humans. Institutional Research Ethics Boards review consent materials to ensure that privacy and confidentiality are addressed appropriately.

Researchers may use de-identification, anonymization, or pseudonymization to protect participant privacy.

Future use of data

When writing an application to a REB specific to the area of study, there are many things to consider. One important consideration is “what happens to the sensitive data in the future?”, which refers to how the sensitive data will be dealt with after the research project is completed.

We can look at UBC’s Behavioural Research Ethics Board (BREB) guidance notes for the Future Use of Data (subsection 8.6) as an example:

“Describe any known future use of the data beyond the conclusion of this research project, and indicate whether participant consent will be obtained in the current consent procedure or if the participant will be contacted later to obtain consent. Either possibility must be described in the consent process. If consent is to be obtained now, future use of data must also be described in full in the consent form. If consent will be obtained later, an amendment will be needed that includes the full details and updated consent form before the additional use of data begins.”

One of the future uses of sensitive data is making it openly accessible and available. As a part of some funding and publishing requirements, de-identified data and research findings may be required to be deposited in a repository. Participants must be informed when the data will be made available and accessible.

More language to inform participants on this topic is specified in UBC BREB subsection 8.6, “Access to Research Data”.

The how-to of data de‑identification

There are several ways of approaching de-identification, each of which has benefits and drawbacks:

Need help? Contact research.data@ubc.ca