Conducting Quantitative Risk Assessments for Anonymized Datasets and Documents: What This Means for Sponsors and Patient Privacy

CANTON, Mich. (10/2/2022) – Data anonymization and document anonymization in clinical trial data is now more important than ever.

In recent years, clinical trial data sharing has become a requirement as part of the regulatory process for EMA and Health Canada. 

It is now required to anonymize the personal information of trial participants before it is shared to protect the participants’ privacy and stay compliant with privacy protection laws- whether for regulatory requirements or voluntary data sharing.

Due to evolving technology and the availability of clinical trial data in various forms, there is always a concern about the re-identification of trial participants, even with data anonymization.

Therefore, assessing the inherent risk of re-identifying a trial participant in the shared data is required.


Estimating the risk of re-identification in anonymized dataset

Risk can be defined as the probability of re-identifying a trial participant. Estimating risk means determining the probability that an intruder would discover the correct identity of a single record.

The re-identification probability depends on the number of participants sharing the same identifiers across the dataset.

The risk level (maximum or average) that needs to be considered is determined by how the data is being shared. You should consider maximum risk when the data is being shared publicly without any security controls and average risk when the data is being shared through a secured portal with security controls.

There are quite a few precedents for what can be considered an acceptable amount of risk. These precedents have been used for many decades, are consistent internationally, and have persisted over time.

Managing re-identification risk means:

(1) selecting an appropriate risk metric (e.g., k-anonymity, l-diversity, t-closeness),

(2) selecting an appropriate threshold (industry standard is to set the threshold at 0.09)

(3) measuring the risk in the actual clinical trial dataset or documents that will be disclosed

Once a threshold has been determined, the actual probability of re-identification is measured in the dataset.

If the probability is higher than the threshold, transformations of the data need to be performed. These transformations may include additional equivalence class categorization and/or data redaction (documents).

Otherwise, the dataset can be declared to have an acceptable risk level for re-identification.


What about anonymized documents?

Work is ongoing within the industry to establish standards for quantitative risk assessment of anonymized and/or redacted documents.

At MMS, we have created a template where quantitative re-identification risk assessment includes a conservative threshold factor based on the uniqueness of the data in the document as compared to the underlying dataset; each variable is weighted based on the number of unique values in the dataset equivalence group divided by the number of participants in the document.

This methodology incorporates the number and uniqueness of the data in the document, compared to the overall dataset, in adjusting the overall risk of re-identification of the participants in the document.


The future of risk assessments

We continue to monitor research and industry trends associated with quantitative risk assessment. Our experts enhance and adjust our efforts in this area to provide cutting-edge solutions to quantify the risk of re-identification of clinical trial datasets and documents.

By: Veera Thota, Principal Statistical Programmer, and Harry Haber, Senior Principal Biostatistician


Learn more about MMS anonymization services here.

If you have questions about risk assessments or anonymizing data or documents, email for more information.