Protecting sensitive data is not only a legal obligation—it’s a commitment to ethical, trustworthy, and responsible research.
Sharing research data is an important scholarly practice. Still, when working with sensitive or personal information, appropriate de-identification techniques are essential to protect participant confidentiality and comply with ethical and legal requirements.
Introduction to Data Anonymization
Sensitive data requires careful handling to protect participant privacy and confidentiality, and to ensure compliance with ethical and legal obligations. Improper management or disclosure of sensitive data can expose individuals, communities, and animals to significant harm.
What is sensitive data?
Under UBC’s Information Technology Standard U1, sensitive data is information and data classified as medium risk, high risk, or very high risk. In broader research contexts, sensitive data could be defined as “information that must be safeguarded against unwarranted access or disclosure” and may apply to both human and animal research.
Sensitive data may involve personal health details, geographical information about endangered species, or information governed by institutional policy.
Why do we anonymize data?
Data anonymization reduces the risk of harm by preventing the re-identification of participants. Re-identification may occur when an individual can be isolated in a dataset and linked to information that identifies them with reasonable effort. The potential harm depends on population characteristics, research topic, and context.
Types of identifying information
Datasets can contain four categories of identifying information:
- Direct identifiers
- Indirect identifiers
- Non-identifiers
- Hidden identifiers
Each category carries different risks and requires different approaches to anonymization.
Direct identifiers
Direct identifiers immediately reveal a participant’s identity and must always be removed before sharing or publishing data. Examples include:
- Names and initials
- Full or partial addresses
- Email addresses
- Vehicle identifiers
- Biometric data
- Audio recordings containing identifiable voices
Indirect identifiers
Indirect identifiers do not identify participants on their own but may do so when combined with other variables. Removing or modifying indirect identifiers requires consideration of context, what can be reasonably inferred, and the size of the potentially identifiable group. Examples include:
- Social media–related variables
- Gender identity
- Income
- Geographic variables
- Occupation or industry
- Ethnicity or immigration-related variables
Non-identifiers
Non-identifiers generally do not identify a participant, such as ordinal ratings or short-term physiological measures. However, even non-identifiers may need additional protection if they relate to sensitive behaviours or characteristics.
Hidden identifiers
Hidden identifiers are pieces of information that appear non-identifying on their own but become identifying when combined with contextual information. Awareness of these risks is important in ensuring an anonymized dataset remains non-identifiable.
Who may be harmed by data disclosure?
Harm from disclosure may disproportionately affect vulnerable or marginalized groups, including:
- Racialized communities
- Lower-income groups
- Children and teens
- Individuals connected to sensitive personal topics such as health conditions, substance use, or private family issues.
k-Anonymity and risk assessment
One method to assess anonymization is k-anonymity, which states that no participant should be distinguishable from fewer than k individuals based on identifying variables. For example, if k = 5, at least four other participants must share the same combination of indirect identifiers.
While useful, k-anonymity is not foolproof, and some re-identification risk always remains. Software tools such as ARX and Amnesia can support anonymization of tabular data.
Consent language and ethical requirements
Informed consent must specify how data will be handled during the study and in future use. Many journals require datasets to be made available, and consent language must anticipate this requirement. In Canada, the Tri‑Council Policy Statement (TCPS 2) outlines ethics requirements for research involving humans. Institutional Research Ethics Boards review consent materials to ensure that privacy and confidentiality are addressed appropriately.
Researchers may use de-identification, anonymization, or pseudonymization to protect participant privacy.
Future use of data
When writing an application to a REB specific to the area of study, there are many things to consider. One important consideration is “what happens to the sensitive data in the future?”, which refers to how the sensitive data will be dealt with after the research project is completed.
We can look at UBC’s Behavioural Research Ethics Board (BREB) guidance notes for the Future Use of Data (subsection 8.6) as an example:
“Describe any known future use of the data beyond the conclusion of this research project, and indicate whether participant consent will be obtained in the current consent procedure or if the participant will be contacted later to obtain consent. Either possibility must be described in the consent process. If consent is to be obtained now, future use of data must also be described in full in the consent form. If consent will be obtained later, an amendment will be needed that includes the full details and updated consent form before the additional use of data begins.”
One of the future uses of sensitive data is making it openly accessible and available. As a part of some funding and publishing requirements, de-identified data and research findings may be required to be deposited in a repository. Participants must be informed when the data will be made available and accessible.
More language to inform participants on this topic is specified in UBC BREB subsection 8.6, “Access to Research Data”.
The how-to of data anonymization
There are several ways of approaching de-identification, each of which has benefits and drawbacks:
Need help? Contact research.data@ubc.ca