< All Topics
Print

Real-World Data and GDPR in Health AI

Health artificial intelligence systems increasingly depend on the availability of large, diverse, and high-quality datasets derived from real-world sources. These real-world data (RWD) streams, encompassing electronic health records, insurance claims, patient-generated data from wearables, and environmental or genomic datasets, offer a critical window into the actual performance and safety profile of medical technologies. However, the pathway to leveraging this data within the European Union is governed by one of the world’s most stringent data protection frameworks: the General Data Protection Regulation (GDPR). For developers, healthcare providers, and data controllers operating in the health AI space, understanding the interplay between the need for robust data and the imperatives of privacy is not merely a compliance exercise; it is a fundamental prerequisite for ethical innovation and market access.

This article provides a detailed analysis of the lawful bases for processing RWD, the practical application of pseudonymisation and anonymisation, and the complex realities of cross-border data flows in the health sector. It is written from the perspective of a practitioner who must translate legal principles into operational reality, bridging the gap between legal theory, technical implementation, and regulatory expectations.

The Foundation: Lawful Bases for Processing Health Data

Under the GDPR, health data is classified as a special category of personal data (Article 9), subject to higher protection standards. Processing such data is, in principle, prohibited unless a specific condition under Article 9(2) is met. For health AI development and deployment, several of these conditions are relevant, but their application is nuanced and context-dependent.

Article 9(2)(a): Explicit Consent

Consent is often the first legal basis considered, but it is also the most frequently misunderstood in the health context. For consent to be valid under GDPR, it must be freely given, specific, informed, and unambiguous. Crucially, for special category data, the consent must be explicit. This means the data subject must give a clear affirmative action, and the request for consent must be separate from other terms and conditions.

In the context of RWD for AI, the main challenge with consent is the concept of granular purpose limitation. It is generally not acceptable to ask a patient for broad consent to use their data for “future research” or “AI development”. The European Data Protection Board (EDPB) has consistently advised that consent for scientific research must be specific to a research area or project, and data subjects must be able to understand what their data will be used for. Furthermore, the GDPR explicitly states that consent is not considered freely given if there is a “clear imbalance” between the data subject and the controller, which is a common scenario in the doctor-patient relationship. While consent can be a valid basis for certain patient registries or specific studies, relying on it for the continuous training of AI models from routine clinical care data is fraught with legal risk. The right to withdraw consent at any time also creates operational challenges for AI models that have already been trained on that data, raising difficult questions about model retraction or retraining.

Article 9(2)(h): Healthcare, Diagnosis, and Preventive or Occupational Medicine

A more robust legal basis for many health AI applications is Article 9(2)(h), which permits the processing of special category data where it is necessary for the purposes of preventive or occupational medicine, medical diagnosis, the provision of health or social care or treatment, or the management of health or social care systems. This basis is often coupled with Article 6(1)(e) of the GDPR, which allows processing necessary for the performance of a task carried out in the public interest or in the exercise of official authority vested in the controller.

This is a powerful basis, but it is not a blank cheque. The processing must be necessary and proportionate. For example, using RWD from a hospital’s EHR system to train an AI model for early detection of sepsis within that same hospital’s patient population could fall under this provision, as it is directly linked to the provision of medical care. However, the same data being used to train a commercial model for sale to third-party hospitals would be on much shakier ground under this basis alone, as the link to the original purpose of care provision becomes attenuated. National laws may also specify which entities (e.g., accredited research institutions, hospitals) can rely on this basis for secondary uses of data.

Article 9(2)(j): Scientific or Historical Research or Statistical Purposes

For AI development that is explicitly framed as scientific research, Article 9(2)(j) provides a legal basis, provided it is enshrined in Union or Member State law which aims to safeguard the rights of the data subject. This is the basis relied upon by many biobanks and large-scale research initiatives like the European Health Data Space (EHDS). The key here is that the research must be conducted in the public interest and adhere to principles of scientific integrity. The data cannot be used for purely commercial purposes that are not linked to a research objective. Furthermore, the data subject’s right to erasure and the right to object are restricted in this context, but not eliminated. The controller must still provide information about the processing and respect the data subject’s rights, albeit within the constraints of the research purpose.

Legitimate Interests: A High Bar for Health Data

While Article 6(1)(f) allows processing based on legitimate interests, the EDPB has made it clear that this basis is generally not suitable for processing special category data. The balancing test required by the legitimate interest basis is unlikely to favour the controller when the data concerns highly sensitive health information. The fundamental rights and freedoms of the data subject will almost always outweigh the controller’s interest in processing this data, unless a specific exemption under Article 9 applies. Therefore, controllers should be extremely cautious about relying on legitimate interests as a standalone basis for health AI.

Pseudonymisation and Anonymisation: Myths and Practical Realities

Many organisations believe that by applying pseudonymisation techniques, they are no longer processing personal data and can therefore bypass GDPR obligations. This is a dangerous misconception. The GDPR makes a clear distinction between personal data, pseudonymised data, and anonymous data.

Pseudonymisation: A Security Measure, Not an Escape

The GDPR defines pseudonymisation in Article 4(5) as the processing of personal data in such a manner that the data can no longer be attributed to a specific data subject without the use of additional information. This additional information must be kept separately and be subject to technical and organisational measures to ensure it is not attributed to an identified or identifiable natural person.

From a legal perspective, pseudonymised data remains personal data. The “additional information” (e.g., the key linking a pseudonym to a name) is often available to the data controller or a trusted third party. Therefore, the data subject is still identifiable, albeit with more effort. Pseudonymisation is a crucial risk-mitigation technique and a key technical safeguard, but it does not change the fundamental legal nature of the data. Controllers using pseudonymised RWD must still have a lawful basis for processing, provide privacy notices, and uphold data subject rights.

However, pseudonymisation is highly encouraged by the GDPR. It is a key factor in demonstrating compliance with the principle of ‘data protection by design and by default’. It can reduce the risk of re-identification and is a prerequisite for certain derogations, such as processing for scientific research purposes where consent is not feasible (Article 89). In practice, a robust pseudonymisation scheme, where the key is held by a separate, trusted entity (a ‘trusted third party’ or ‘pseudonymisation provider’) and is not accessible to the AI developers, significantly strengthens the privacy case and can be a key enabler for data sharing in health research consortia.

Anonymisation: The Gold Standard but a High Bar

True anonymisation, where the data is no longer personal data because the individual is no longer identifiable, is the ultimate goal for many data-sharing initiatives. If data is genuinely anonymous, the GDPR does not apply. However, the bar for achieving this is exceptionally high. The concept of “identifiability” must be interpreted broadly. A data subject is identifiable if they can be identified “directly or indirectly, by reference to an identifier such as a name, an identification number, location data, an online identifier, or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person.”

The risk of re-identification is not static; it evolves with technology and the availability of other datasets. This is the concept of the “motivated intruder”. A controller must consider whether a reasonably motivated third party, with access to publicly available information or other datasets, could re-identify individuals. Simply removing names and addresses is insufficient. Combinations of quasi-identifiers (e.g., postcode, date of birth, rare disease code) can often lead to re-identification. Techniques like k-anonymity, l-diversity, and differential privacy are used to mitigate these risks, but no technique is foolproof. The EDPB has issued guidance stating that anonymisation should be assessed on a case-by-case basis and that the controller must document the process and the residual risk of re-identification. For RWD in health, where datasets are rich and often linked to other sources, achieving true anonymisation while retaining sufficient data utility for AI model training is a significant technical and legal challenge.

Myth: “We can use the data for AI if we just pseudonymise it”

Reality: Pseudonymisation is a vital security control, but it does not create a new legal basis for processing. You still need to satisfy one of the conditions in Article 6 and, for health data, one of the conditions in Article 9. The use of pseudonymised data for AI training does not absolve the controller of its GDPR obligations.

Myth: “If we remove direct identifiers, the data is anonymous”

Reality: Anonymisation is about the practical possibility of re-identification, not just the removal of obvious identifiers. The richness of health data (e.g., multiple diagnoses, timestamps, treatment pathways) creates a high risk of re-identification, especially when combined with other datasets. The burden of proof for anonymity lies with the controller.

Cross-Border Data Realities: From Schrems II to the EHDS

Health AI development is often a collaborative, cross-border effort. Data may be collected in one Member State, processed by a cloud provider with servers in another, and used to train a model by a team in a third. This creates significant legal complexity under GDPR.

International Transfers: A Persistent Challenge

GDPR Chapter V (Articles 44-50) governs the transfer of personal data to countries outside the European Economic Area (EEA). The core principle is that a transfer can only take place if the destination country ensures an adequate level of data protection. The EU’s “adequacy decisions” (e.g., for Japan, the UK, and recently, the US under the EU-US Data Privacy Framework) provide a legal pathway. However, for many countries, no adequacy decision exists.

In the absence of adequacy, controllers must rely on appropriate safeguards, most commonly the Standard Contractual Clauses (SCCs) adopted by the European Commission. The use of SCCs is not a simple box-ticking exercise. Following the landmark Schrems II judgment of the Court of Justice of the EU (CJEU), controllers must conduct a Transfer Impact Assessment (TIA). This involves:

  1. Mapping the data flow to understand what data is being transferred, where it is going, and who will have access to it.
  2. Assessing the laws and practices of the destination country, particularly regarding government access to data (e.g., for surveillance purposes).
  3. Identifying any gaps between the protection offered by the destination country’s laws and the protection required by GDPR.
  4. Implementing supplementary measures to bridge any identified gaps.

For health AI, this is particularly challenging. If a US-based cloud provider is processing pseudonymised EU health data, the TIA must consider the potential for US intelligence agencies (under FISA 702) to request access to that data and compel the provider to disclose the “additional information” needed for re-identification. Supplementary measures might include strong encryption (with keys held exclusively within the EEA), contractual clauses prohibiting the provider from complying with unlawful access requests, and technical measures to ensure data is processed in a way that minimises the risk of access. The EDPB has issued recommendations on these measures, but their practical application requires deep technical and legal expertise.

The European Health Data Space (EHDS): A Paradigm Shift

The proposed regulation on the European Health Data Space (EHDS) aims to create a common framework for the use of electronic health data for both primary care (treatment) and secondary use (research, innovation, policy-making). If adopted, the EHDS will fundamentally change the landscape for RWD in health AI.

The EHDS establishes a system of “health data access bodies” in each Member State, which will act as one-stop-shops for authorising requests to use anonymised or pseudonymised electronic health data for secondary purposes. It introduces the concept of a “secondary use” legal basis, which would allow for the processing of electronic health data for research, innovation, and public health purposes, subject to strict conditions. This could potentially streamline the process for accessing RWD for AI development, reducing the need to navigate complex national laws and seek multiple consents. The EHDS also proposes the creation of a “single market for data” where data can be shared across borders under a harmonised framework, simplifying the cross-border transfer rules for data used within the EHDS ecosystem. While the EHDS is still in the legislative process, it signals a clear direction of travel: towards more structured, secure, and harmonised access to health data for AI and research, but under the close supervision of public authorities.

National Implementations and Divergences

While GDPR is a regulation, meaning it is directly applicable in all Member States, it contains numerous opening clauses that allow national governments to legislate on specific aspects. This has led to a patchwork of national laws that can create complexity for pan-European health AI projects.

For instance, the processing of health data for scientific research is governed by national law under Article 9(2)(j) and Article 89 of the GDPR. This means the conditions under which research can be conducted without explicit consent vary significantly. In some countries, like the Netherlands, a system of “opt-out” for research purposes is permitted under certain conditions, provided the research is of public interest. In other countries, such as Germany, the national law (the Federal Data Protection Act) sets a very high bar, often requiring explicit consent or a clear statutory basis for research, and may require approval from an ethics committee. Similarly, the rules for processing data deceased persons vary, which is relevant for retrospective studies.

Controllers operating across Europe must therefore not only comply with the GDPR but also understand the specific national legislation in each country where they process data. This requires a detailed legal analysis and a flexible data governance framework that can adapt to different national requirements. For example, a hospital consortium spanning France, Germany, and Italy would need to map out the different legal bases and procedural requirements (e.g., ethics committee approvals, data protection authority notifications) in each jurisdiction before launching a joint AI training project using RWD from all three sites.

Practical Safeguards and Accountability

Compliance with GDPR is not just about selecting the right legal basis; it is about demonstrating accountability through robust technical and organisational measures. For health AI, this means integrating data protection principles throughout the entire AI lifecycle.

Data Protection by Design and by Default

This principle requires that data protection measures are embedded into the design of the AI system from the outset. For RWD, this includes:

  • Minimisation: Only collect and process the data that is strictly necessary for the intended purpose. For AI training, this may involve using feature selection techniques to exclude highly sensitive or unnecessary data points.
  • Pseudonymisation: As a default, process RWD in a pseudonymised form. The link to the data subject’s identity should be held separately and only accessed when absolutely necessary.
  • Secure Processing Environments: Use secure, encrypted environments for data processing and model training. Access controls must be strict and auditable. Techniques like federated learning, where the model is trained on data that remains at its source location without being centralised, can be a powerful tool for privacy-preserving AI development.

Transparency and Explainability

GDPR grants data subjects a right to information about the processing of their data (Article 13/14). For health AI, this means providing clear and accessible information about:

  • What RWD is being used.
  • For what specific purpose (e.g., “to train an AI model to detect diabetic retinopathy”).
  • What the lawful basis is.
  • Who the data is being shared with (e.g., a research partner, a cloud provider).
  • How long the data will be retained.
  • The existence of any automated decision-making and the logic involved, as well as the consequences of such processing.

This is particularly challenging for complex AI models. While a full technical explanation of a deep learning model may be impossible, the information provided must be meaningful. It should explain the purpose and logic in a way that an average data subject can understand, without necessarily revealing the proprietary algorithm.

Conducting a Data Protection Impact Assessment (DPIA)

Processing RWD for health AI is likely to be “high-risk” under GDPR, triggering the requirement for a DPIA (Article 35). A DPIA is a formal process to identify and mitigate risks to the rights and freedoms of individuals. For a health AI project, a DPIA should systematically address:

  • A systematic description of the processing, including the sources of RWD, the types of data, the AI models used, and the intended outcomes.
  • An assessment of the necessity and proportionality of the processing in relation to its purpose.
  • An assessment of the risks to data subjects’ rights (e.g., re-identification, discrimination based on algorithmic bias, lack of
Table of Contents
Go to Top