< All Topics
Print

Clinical Evaluation for AI: What Evidence Means in Practice

Artificial intelligence systems intended for use in healthcare occupy a unique position at the intersection of rapid technological innovation and stringent patient safety requirements. For developers, clinical institutions, and regulatory professionals, the pathway to placing a medical device powered by AI on the European market is governed by a complex interplay of legal frameworks. The core of this regulation is not merely about proving that an algorithm functions as intended in a controlled environment; it is about demonstrating, through robust and continuous evidence, that the system is safe, effective, and performs as expected across the diverse and dynamic realities of clinical practice. The transition from the Medical Device Directive (MDD) to the Medical Device Regulation (MDR) and the introduction of the AI Act have significantly raised the bar for what constitutes acceptable clinical evidence. This article examines the evidentiary requirements for the clinical evaluation of AI systems, focusing on dataset quality, validation design, generalization, and the critical role of post-market surveillance.

The Regulatory Landscape: MDR, IVDR, and the AI Act

The clinical evaluation of any medical device in the European Union is anchored in the Medical Device Regulation (EU) 2017/745 (MDR) and the In Vitro Diagnostic Medical Device Regulation (EU) 2017/746 (IVDR). These regulations mandate a lifecycle approach to safety and performance, beginning with pre-market assessment and extending through post-market surveillance. For AI and machine learning-enabled devices, the evidentiary burden is particularly high due to their potential for adaptive behavior and their reliance on data-driven training. The MDR and IVDR require that clinical evidence be sufficient to demonstrate conformity with the relevant General Safety and Performance Requirements (GSPRs). This evidence must be generated through a systematic and planned process, documented in the Clinical Evaluation Report (CER), and continuously updated throughout the device’s lifecycle.

More recently, the Artificial Intelligence Act (AI Act) introduces a horizontal layer of regulation for high-risk AI systems, which includes most medical devices. While the MDR/IVDR focus on clinical safety and performance, the AI Act addresses risks related to fundamental rights, non-discrimination, and robustness from a data governance and algorithmic transparency perspective. For medical AI, the two frameworks are complementary. A manufacturer must satisfy the clinical evidence requirements of the MDR/IVDR while also meeting the data quality, risk management, and post-market monitoring obligations of the AI Act. This creates a dual compliance burden where the evidence generated for one regulation often serves to meet the requirements of the other, but the focus and specific criteria can differ.

Defining Clinical Evidence and Clinical Evaluation

It is essential to distinguish between clinical evidence and clinical evaluation. Clinical evidence is the collection of data and information regarding the safety and performance of a device. This includes data from clinical investigations, literature reviews, and real-world performance monitoring. The clinical evaluation is the systematic and structured process of assessing this evidence to verify the device’s safety, performance, and benefit-risk profile. The output of this process is the Clinical Evaluation Report (CER), a foundational document for regulatory submission.

Under the MDR, clinical evaluation must be conducted for all devices, regardless of class, and must follow a defined plan. For higher-risk devices (Class IIa, IIb, and III), the evaluation must be more rigorous and may require clinical investigation data. The regulation emphasizes that clinical evaluation must be based on current state-of-the-art knowledge and must consider the intended purpose, technical features, and the clinical context in which the device will be used. For AI systems, this means the evaluation must address not only the algorithm’s output but also its interaction with users, its integration into clinical workflows, and its potential for automation bias.

Evidence from Clinical Investigations: In-Silico, In-Vitro, and In-Vivo

The traditional model of clinical evidence relies heavily on prospective, randomized controlled trials (RCTs). While this remains the gold standard for many interventions, it is often impractical or ethically challenging for AI systems, particularly those that evolve over time or are intended to support complex diagnostic decisions. The MDR and IVDR acknowledge this by allowing for a broader range of data sources to be included in the clinical evaluation. This includes data from scientific literature, performance evaluation studies (for IVDR), and well-designed clinical investigations that may not be RCTs.

For AI, a particularly important category of evidence is in-silico validation. This involves using computational models and synthetic data to test an algorithm’s performance under a wide range of simulated conditions. While in-silico data alone is rarely sufficient to demonstrate clinical benefit, it is invaluable for exploring the boundaries of an algorithm’s performance, identifying failure modes, and assessing robustness against rare but critical scenarios. Regulatory bodies are increasingly open to in-silico evidence, provided it is part of a wider evidence generation strategy that includes real-world data. The key is that the simulation models must be validated against real-world phenomena, and the assumptions underlying the synthetic data must be transparently documented.

In-vitro data, particularly for diagnostic AI, involves testing the algorithm against a curated set of reference samples or datasets. This is a crucial step in establishing analytical validity—the ability of the system to accurately and reliably detect the target signal. However, the MDR and IVDR require that analytical performance be linked to clinical performance. An algorithm that perfectly identifies a biomarker in a lab setting is not necessarily clinically useful if it cannot do so in the noisy, variable environment of a clinical laboratory or point-of-care setting.

In-vivo clinical investigations remain the cornerstone of evidence for high-risk AI systems. These studies collect data from patients in a real-world or simulated clinical setting. For AI, the design of these studies is critical. A key question is whether to study the AI as an isolated intervention or as part of a clinical pathway. Increasingly, regulators expect evidence of the AI’s impact on the overall clinical decision-making process, including its effect on clinician behavior, diagnostic accuracy, and patient outcomes. This requires study designs that can measure net clinical benefit, not just the algorithm’s standalone accuracy.

Dataset Quality as a Pillar of Evidence

In the context of AI, the quality of the training and validation datasets is not merely a technical prerequisite; it is a fundamental component of the clinical evidence. A model is only as good as the data it learns from. The MDR and AI Act both place significant emphasis on data quality, relevance, and representativeness. A dataset that is technically perfect but clinically biased will produce an algorithm that is unsafe or ineffective for certain patient populations.

Representativeness is the most critical attribute of a dataset for clinical evaluation. The data used to train and validate an AI system must reflect the population and clinical conditions for which the device is intended. This includes demographic diversity (age, sex, ethnicity), clinical diversity (disease severity, comorbidities), and technical diversity (different imaging equipment, laboratory analyzers, or data sources). For example, a dermatology AI trained primarily on images from light-skinned individuals will have a demonstrably lower performance and could lead to misdiagnosis when used on darker skin tones. Regulators will scrutinize the demographic composition of the dataset and require justification for any underrepresented groups. The burden of proof is on the manufacturer to demonstrate that the algorithm’s performance is equitable across the intended population.

Beyond representativeness, the data quality itself is under scrutiny. This encompasses accuracy, completeness, and consistency of the data. For medical imaging, this means high-quality annotations, free from artifacts, and consistent labeling protocols. For clinical data, it means accurate and complete electronic health records (EHR) or case report forms. The MDR requires that data be “of sufficient quality and quantity” to support the claims being made. This is a qualitative and quantitative assessment. A dataset with 100,000 images of poor diagnostic quality may be less valuable than a dataset of 5,000 high-quality, expertly annotated images.

Finally, the provenance and integrity of the data are essential. The manufacturer must be able to trace the origin of the data, demonstrate that it was collected ethically and with appropriate consent, and show that it has been handled in a way that preserves its integrity. This is particularly important when using retrospective data from multiple sources, as is common in AI development. The AI Act introduces specific requirements for the curation of training data, including the mitigation of biases and the documentation of data sources and preprocessing steps. This documentation becomes part of the technical file and is subject to regulatory review.

Validation Design: Establishing Generalization and Robustness

Validation is the process of providing objective evidence that an AI system meets its specified requirements and is fit for its intended purpose. For medical AI, this goes far beyond a simple accuracy score. The validation design must be structured to prove that the algorithm will perform safely and effectively in the real world—a concept known as generalization. A model that performs well on the data it was trained on (or even on a hold-out test set from the same source) has not yet proven its generalizability.

A robust validation strategy typically involves multiple stages and data sources:

Internal Validation

This is the initial validation performed on data from the development environment, often using techniques like cross-validation. It is necessary to ensure the model has learned the underlying patterns in the data but is insufficient for regulatory approval on its own.

External Validation

This is a critical step where the algorithm is tested on data from a different source than the training data. This could be data from a different hospital, a different country, or a different time period. External validation provides the first real evidence of generalizability. Regulators will look for performance metrics that are consistent across different validation sites. Significant performance drops are a major red flag, indicating that the model may be overfitted to the training environment.

Prospective Validation

For many high-risk AI systems, regulators will require a prospective validation study. This involves testing the algorithm in a real clinical setting, in real-time, to assess its performance and its impact on clinical workflows and patient outcomes. This type of study is the strongest form of evidence for generalization, as it exposes the algorithm to the full complexity and unpredictability of the clinical environment, including user variability, workflow interruptions, and data quality issues that are not present in curated datasets.

An essential aspect of validation is assessing robustness. A robust AI system maintains its performance even when faced with small, unexpected perturbations in its input data. For an imaging AI, this could mean changes in image contrast, the presence of artifacts, or different imaging protocols. For a natural language processing AI, it could mean variations in clinical terminology or reporting styles. The validation process must intentionally stress-test the algorithm against these variations to identify potential failure points. The AI Act explicitly requires that high-risk AI systems be robust against such errors and that their level of accuracy be maintained throughout their lifecycle.

Regulatory Interpretation: The concept of “state of the art” is dynamic. For AI, this means that validation methods considered acceptable today may be superseded by new best practices tomorrow. Manufacturers must continuously monitor the scientific literature and regulatory guidance to ensure their validation approaches remain aligned with the state of the art.

Post-Market Surveillance and the Living Evidence Base

The evidentiary requirements for AI systems do not end once the device is placed on the market. The MDR and the AI Act both mandate a continuous, proactive monitoring process known as Post-Market Surveillance (PMS). For AI, this is not just about identifying device failures in the traditional sense; it is about monitoring for model drift, performance degradation, and emergent biases in a real-world population.

Model drift occurs when the statistical properties of the real-world data change over time, causing the model’s performance to degrade. This can happen for many reasons: changes in patient demographics, new disease variants, updates to clinical equipment, or shifts in clinical practice. A PMS plan for an AI system must include a strategy for detecting this drift. This often involves setting up a continuous monitoring pipeline where the algorithm’s predictions are compared against clinical outcomes on an ongoing basis. Key Performance Indicators (KPIs) must be defined not just for accuracy but also for fairness and consistency across different subgroups.

The PMS process feeds directly into the Periodic Safety Update Report (PSUR) for MDR Class IIa/IIb/III devices or the Performance Evaluation Report (PER) for IVDR devices. These reports must be updated regularly and submitted to the Notified Body. For AI, the PSUR/PER should include an analysis of real-world performance data, a summary of any corrective actions taken (e.g., retraining the model), and an assessment of whether the device’s benefit-risk profile remains favorable. This creates a “living” evidence base where the initial clinical evaluation is continuously updated with new real-world data.

Furthermore, the AI Act introduces the concept of Post-Market Monitoring (PMM) systems, which are specific to AI. These systems are designed to collect and analyze data on the AI system’s performance and any incidents of misuse. The manufacturer is obligated to actively monitor for risks related to the AI system’s interaction with the environment and to report serious incidents to the national competent authorities. This places a significant operational burden on manufacturers to build the necessary data infrastructure for continuous monitoring and reporting.

The Role of Notified Bodies and National Competent Authorities

Throughout the clinical evaluation process, the Notified Body acts as the independent auditor. For high-risk AI devices, the Notified Body will have a team of experts, including clinical specialists and data scientists, who will scrutinize the clinical evaluation file. They will assess the adequacy of the clinical evidence, the validity of the validation study designs, and the robustness of the PMS plan. The Notified Body will pay close attention to the justification for the chosen data sources and the steps taken to mitigate bias. They will also review the technical documentation to ensure it aligns with the clinical claims being made.

National Competent Authorities (NCAs), such as the BfArM in Germany or the ANSM in France, are responsible for market surveillance. They can conduct audits of manufacturers and healthcare institutions to ensure compliance. In the context of AI, NCAs are increasingly developing their own guidance on what constitutes good clinical practice for AI studies and what evidence they expect to see for specific types of algorithms. This can lead to a degree of divergence between Member States, although the MDR and AI Act aim to harmonize standards across the EU. Manufacturers must be aware of any specific national guidance that may apply to their device, especially if they plan to conduct clinical investigations in a particular country.

Practical Challenges and Future Directions

The evidentiary requirements for clinical evaluation of AI are still evolving. One of the most significant practical challenges is the black box nature of many advanced AI models. Regulators require a degree of explainability to understand how a device arrives at its conclusions, especially in high-stakes clinical decisions. While the MDR does not explicitly demand “full explainability,” the need to justify clinical decisions and manage risks means that manufacturers must provide sufficient information to allow clinical users to understand the device’s limitations. This is an area of active research and regulatory discussion.

Another challenge is the use of real-world data (RWD) and real-world evidence (RWE). While regulators encourage the use of RWD to support clinical evaluations, there are significant hurdles related to data quality, standardization, and privacy (GDPR). Accessing and using data from different healthcare systems across Europe is complex. Initiatives like the European Health Data Space (EHDS) aim to facilitate this, but for now, leveraging RWD for regulatory purposes remains a major undertaking that requires careful planning and collaboration with clinical partners.

Ultimately, the regulatory framework for AI in healthcare is designed to foster innovation while ensuring patient safety. The evidentiary requirements are not a barrier but a roadmap for developing robust, reliable, and equitable AI systems. By focusing on high-quality data, rigorous and diverse validation, and a commitment to continuous post-market monitoring, manufacturers can build the body of evidence needed to gain regulatory approval and, more importantly, to earn the trust of clinicians and patients. The process is demanding, but it is the necessary foundation for the responsible integration of AI into European healthcare.

Table of Contents
Go to Top