< All Topics
Print

Testing for Discrimination: A Practical Audit Plan

Testing for discrimination in artificial intelligence systems is not primarily a statistical challenge; it is a governance challenge that requires a structured, auditable process. European regulators and courts are moving beyond high-level principles toward verifiable evidence of fairness, and organisations must be prepared to demonstrate that their systems do not produce unlawful differential treatment. The following plan provides a practical, end-to-end approach to bias and discrimination testing, integrating legal obligations, technical methods, and documentation practices that withstand scrutiny under the EU AI Act, the GDPR, and national anti-discrimination frameworks. It is written for professionals who build, procure, or oversee AI and automated decision systems in both public and private sectors.

Before designing tests, it is essential to anchor the effort in the legal landscape. The EU AI Act introduces a risk-based compliance regime where high-risk AI systems must be designed for robustness, accuracy, and fairness. The Act requires data governance to minimise discriminatory risk, mandates human oversight, and expects continuous monitoring throughout the system lifecycle. The GDPR complements this by prohibiting decisions based solely on automated processing that produce legal or similarly significant effects for individuals when those decisions rely on special category data or otherwise infringe fundamental rights. National anti-discrimination laws, such as those implementing the EU’s equal treatment directives, prohibit discrimination on grounds including sex, race, ethnic origin, age, disability, religion, and sexual orientation. In practice, this means that a system may be compliant with the AI Act’s process requirements yet still violate anti-discrimination law if its outputs produce unjustified disparate impacts.

Establishing the Legal and Regulatory Frame

Organisations often conflate “bias” as a technical metric with “discrimination” as a legal concept. Bias is a measurable deviation in model performance or outcomes across groups; discrimination is a legal determination about whether differential treatment is justified, proportionate, and lawful. The two are related but not equivalent. European law distinguishes between direct discrimination (treating a person less favourably because of a protected characteristic) and indirect discrimination (a neutral rule that puts persons with a protected characteristic at a particular disadvantage unless it is a proportionate means of achieving a legitimate aim). In AI, indirect discrimination is the more common risk: a feature set that appears neutral may correlate with protected characteristics, leading to systematically less favourable outcomes for certain groups.

The EU AI Act does not define specific numerical thresholds for “fairness.” Instead, it sets obligations to identify, reduce, and document risks of discrimination arising from data, design choices, and operating context. The Act’s emphasis on data governance, robustness, and post-market monitoring implies that organisations must be able to show a coherent chain of evidence: from the definition of the use case and the selection of fairness metrics, through to testing results, mitigation actions, and ongoing oversight. National regulators may supplement these obligations with sector-specific guidance. For example, financial services authorities in several Member States require explainability and fairness testing for credit scoring, while public sector bodies may be subject to additional transparency obligations under national administrative law.

Scope of Testing: What Is Being Assessed

Testing for discrimination should begin with a clear articulation of the system’s purpose and the decisions it supports. The scope must include the full pipeline: data inputs, feature engineering, model training, decision thresholds, and the presentation of outputs to end users. It should also consider downstream effects, such as how a recommendation system influences access to opportunities or how a risk score informs resource allocation. The test plan must define the protected characteristics relevant to the use case and the legal basis for processing any personal data that could reveal those characteristics, directly or indirectly.

It is critical to distinguish between direct uses of protected attributes and proxies. Many systems do not include explicit fields like “sex” or “ethnicity,” but variables such as postcode, educational institution, or transaction patterns can act as strong proxies. European data protection authorities have emphasised that using proxies to infer protected characteristics can be unlawful if it results in discriminatory outcomes or lacks a valid legal basis. Therefore, the testing plan must include an analysis of proxy variables and their correlation with protected attributes, even when those attributes are not explicitly used.

Roles and Accountability

Effective testing requires a cross-functional team. Legal counsel interprets the applicability of anti-discrimination law and the AI Act’s obligations. Data scientists and machine learning engineers design and execute tests, generate metrics, and implement mitigations. Domain experts ensure that the fairness definitions reflect the real-world context and legitimate aims of the system. Compliance officers or a Data Protection Officer (DPO) oversee documentation and alignment with GDPR requirements, particularly when personal data is processed. In public sector contexts, an ethics board or oversight body may need to review findings. The AI Act’s emphasis on human oversight implies that these roles must be empowered to intervene, not merely to observe.

Defining the Fairness Objective

There is no single, universally accepted definition of fairness. The choice of fairness metric is a normative decision that must be justified in light of the system’s purpose and legal constraints. The test plan should explicitly select and document the fairness definitions used, along with the rationale. Common families of fairness metrics include:

  • Group fairness: Measures whether outcomes are similar across protected groups. Examples include equalised odds (similar true positive and false positive rates) and demographic parity (similar selection rates). These metrics are useful for detecting systematic disparities but may conflict with accuracy or calibration.
  • Individual fairness: Requires that similar individuals receive similar outcomes. This is legally intuitive but challenging to implement, as it depends on defining a meaningful similarity metric.
  • Counterfactual fairness: Assesses whether the outcome would change if a protected attribute were different, holding all else constant. This aligns with legal reasoning about causation but requires careful construction of counterfactuals.

European regulators are unlikely to mandate a specific metric. Instead, they will expect a reasoned choice that reflects the context. For example, in hiring, equalised odds may be appropriate to ensure that qualified candidates from all groups have similar chances of progressing. In credit scoring, calibration (that predicted probabilities reflect actual outcomes) may be essential to avoid mispricing risk, while demographic parity may be inappropriate if groups genuinely differ in creditworthiness for non-discriminatory reasons. The test plan must articulate why a particular metric is fit for purpose and how it aligns with the principle of proportionality.

Legitimate Aim and Proportionality

Under EU law, even a discriminatory effect can be lawful if it is a proportionate means of achieving a legitimate aim. This requires a documented justification that includes:

  • The legitimate aim of the system (e.g., fraud prevention, safety, credit risk management).
  • Why the chosen approach is necessary and why less discriminatory alternatives are not feasible or would undermine the aim.
  • Evidence that the impact is proportionate, considering the severity of the disadvantage and the benefits achieved.

This justification should be part of the audit documentation and revisited when monitoring reveals disparities. The AI Act’s risk management framework supports this by requiring continuous evaluation and mitigation, which includes revisiting proportionality as new evidence emerges.

Data Inventory and Proxy Analysis

Testing begins with data. The plan should include an inventory of all data sources, their provenance, and the legal basis for processing. For each dataset, document:

  • Which protected characteristics could be inferred, directly or indirectly.
  • Which fields are used as features and which are excluded.
  • How missing values and underrepresented groups are handled.
  • Any sampling biases or historical biases embedded in the data.

Proxy analysis is essential. Postcodes can correlate with ethnicity or socioeconomic status; educational institutions can correlate with age or socioeconomic background; transaction histories can correlate with disability or family status. The test plan should quantify these correlations and assess whether using a proxy is necessary and proportionate. If a proxy is strongly correlated with a protected attribute and leads to disadvantage, the organisation must either remove it, mitigate its impact, or provide a robust justification.

It is also important to assess data quality across groups. If a dataset contains more errors for certain groups, the model may learn to penalise those groups unfairly. Document data quality metrics by group and correct imbalances before training where feasible. The AI Act’s data governance requirements emphasise relevance, representativeness, and freedom from errors that could lead to discrimination.

Special Category Data and Inference Risks

Under GDPR, processing special category data (e.g., health, biometric, ethnic origin) is prohibited unless a specific exception applies. Even when such data is not directly processed, models may infer it. The test plan should assess the risk of inference and ensure that any processing of inferred special category data is lawful and necessary. If inference cannot be prevented, the organisation must document why it is unavoidable and what safeguards are in place to prevent discriminatory use.

Test Design and Methodology

The audit plan should include a mix of pre-deployment and post-deployment tests. Pre-deployment tests assess the model before it is released; post-deployment tests monitor ongoing performance as data drifts and user populations change.

Pre-Deployment Tests

Start with descriptive analysis of the dataset to understand baseline distributions across protected groups. Then evaluate model performance using the chosen fairness metrics. It is advisable to test across multiple metrics to avoid over-optimising for a single measure. For example, a model may achieve demographic parity but do so by systematically misclassifying a protected group, which would be unacceptable. Use confusion matrices, calibration plots, and threshold analysis to understand how decisions vary across groups.

Conduct stress tests that simulate edge cases and adversarial conditions. For instance, test how the system behaves when input data is incomplete for certain groups or when the prevalence of a protected characteristic shifts. This helps identify vulnerabilities that could lead to discriminatory outcomes under real-world variation.

Perform counterfactual tests by creating synthetic records that differ only in a protected attribute or a strong proxy. If the outcome changes significantly, the model may be relying on discriminatory signals. Document the magnitude of the change and whether it exceeds a predefined tolerance.

Post-Deployment Monitoring

Once the system is live, implement ongoing monitoring to detect drift and emergent disparities. Track fairness metrics alongside accuracy and calibration. Set thresholds that trigger review when disparities exceed a defined level. The AI Act’s post-market monitoring plan should include these fairness indicators as part of the overall performance monitoring. Document any incidents where the system produces discriminatory outcomes, the remedial actions taken, and the timeline for resolution.

Include user feedback mechanisms. Individuals who believe they have been unfairly treated should be able to challenge decisions. The test plan should specify how such challenges are investigated and how findings feed back into the model lifecycle. Under GDPR, individuals have the right to object to profiling and to request human review of automated decisions; the AI Act reinforces the need for accessible redress.

Cross-Country Considerations

While the AI Act harmonises core obligations, national anti-discrimination laws and enforcement practices vary. Some countries impose stricter thresholds for indirect discrimination or require proactive equality assessments in the public sector. For example, the UK’s Public Sector Equality Duty (which continues to apply in Northern Ireland and informs practice in other jurisdictions) requires public bodies to have due regard to equality impacts. In France, the CNIL has issued guidance on algorithmic discrimination and may require impact assessments for certain high-risk uses. In Germany, sectoral regulators and data protection authorities may impose additional transparency obligations. The test plan should be adaptable to national requirements, particularly for organisations deploying systems across multiple Member States.

Interpreting Results: From Metrics to Legal Judgement

Numbers alone do not determine legality. Interpreting fairness metrics requires contextual judgement. A disparity in selection rates may be acceptable if it reflects a legitimate aim and is proportionate. Conversely, a small disparity may be unlawful if it results from a discriminatory rule or a proxy that cannot be justified. The interpretation process should follow a structured framework:

  • Identify the disparity: Quantify the difference in outcomes across groups using the chosen metrics.
  • Assess materiality: Determine whether the disparity is clinically or practically significant, not just statistically significant.
  • Investigate causes: Analyse whether the disparity arises from data quality issues, feature selection, model design, or threshold choices.
  • Consider justification: Evaluate whether the disparity serves a legitimate aim and is proportionate.
  • Document the reasoning: Record the decision-making process, including any trade-offs between fairness and accuracy.

It is important to avoid “fairness gerrymandering,” where a model is tuned to pass a specific metric while still producing discriminatory outcomes in practice. Using multiple metrics and cross-checking with qualitative expert review helps prevent this. The AI Act’s emphasis on robustness implies that models should not be brittle to fairness checks; they should maintain performance under scrutiny and across different subgroups.

When to Halt or Modify a System

The audit plan should define clear thresholds for action. If a system produces significant, unjustified disparities for a protected group, it should not be deployed or should be taken offline until mitigated. Mitigations may include:

  • Removing or transforming problematic features.
  • Adjusting decision thresholds to equalise error rates.
  • Rebalancing training data or using reweighting techniques.
  • Introducing human-in-the-loop review for borderline cases.
  • Adding transparency measures that allow users to understand and contest decisions.

Document the rationale for any mitigation and retest to confirm effectiveness. If no acceptable mitigation exists, the organisation may need to reconsider the use case or restrict its scope to avoid discriminatory impacts.

Documentation and Evidence Management

Documentation is a compliance deliverable, not an afterthought. Under the AI Act, high-risk systems require technical documentation that includes data governance, testing methods, and risk management measures. Under GDPR, records of processing activities and data protection impact assessments (DPIAs) must capture risks to rights and freedoms and the measures taken to address them. For discrimination testing, the documentation should include:

  • System description: Purpose, stakeholders, deployment context, and decision types.
  • Legal analysis: Applicable laws, legitimate aims, and proportionality justification.
  • Data inventory: Sources, features, proxies, and data quality assessments by group.
  • Fairness definitions: Chosen metrics, thresholds, and rationale.
  • Test methodology: Datasets used, sampling strategies, test scenarios, and stress tests.
  • Results: Quantitative metrics, visualisations, counterfactual analyses, and incident logs.
  • Mitigation actions: Changes made, retesting outcomes, and residual risks.
  • Monitoring plan: Ongoing metrics, alert thresholds, and governance processes.
  • Redress mechanisms: How individuals can challenge decisions and how challenges are handled.

Version control is essential. Each update to the model or data pipeline should trigger a new documentation entry, including the date, author, and summary of changes. This supports traceability and enables regulators to reconstruct the compliance journey. For public sector bodies, documentation may need to be disclosed under transparency laws, so it should be written in clear, accessible language.

Privacy-Preserving Approaches

Testing often requires access to sensitive data. Organisations should adopt privacy-preserving techniques to minimise risk. Differential privacy can protect individual records while allowing aggregate analysis. Federated learning can enable testing across datasets without centralising personal data. Synthetic data, if properly validated, can be used for stress tests and counterfactuals. Document the methods used and the privacy safeguards applied. This demonstrates compliance with GDPR’s data minimisation and security principles while supporting robust fairness testing.

Communication of Findings

Communicating fairness findings requires tailored messages for different audiences. Internal stakeholders (executives, product managers, legal, and engineering) need actionable insights and clear recommendations. External stakeholders (regulators, auditors, affected individuals) need transparency and accessible explanations. The communication plan should include:

  • Executive summary: High-level findings, key risks, and decisions required.
  • Technical report: Detailed metrics, methodologies, and code or configuration details sufficient for replication.
  • Public summary: A non-technical explanation of the system’s purpose, fairness approach, and known limitations, suitable for transparency obligations.
  • Individual notices: Where automated decisions have significant effects, provide meaningful information about the logic involved and the right to human review, as required by GDPR.

When communicating disparities, avoid euphemisms that obscure impact. Clearly state the nature of the disparity, the affected group, and the steps taken to address it. If trade-offs were made (e.g., between accuracy and equalised odds), explain why the chosen balance is proportionate. The AI Act’s emphasis on human oversight implies that decision-makers must understand the system well enough to intervene; clear communication is a prerequisite.

Handling Sensitive Disclosures

Public disclosure of fairness findings can be sensitive. The test plan should specify how to handle findings that reveal discriminatory impacts without exposing individuals to harm. Where possible, publish aggregate results and mitigation measures. If the system is operated by a public authority, be mindful of national transparency laws and any exemptions that protect commercial confidentiality or security. In all cases, ensure that affected individuals can access information relevant to their own decisions.

Operationalising the Audit Plan

To make the plan operational, embed fairness testing into the organisation’s standard processes. This includes integrating it into the software development lifecycle, change management, and risk management frameworks. The following steps provide a practical roadmap:

  1. Initiate: Define the use case, stakeholders, and legal context. Identify protected characteristics and potential proxies.
  2. Assess: Inventory data, perform proxy analysis, and conduct a DPIA where required. Document legitimate aims and proportionality.
  3. Design: Select fairness metrics, thresholds, and test scenarios. Plan pre-deployment and post-deployment tests
Table of Contents
Go to Top