< All Topics
Print

Proxy Variables and Indirect Discrimination

European legal frameworks and technical systems are increasingly intertwined, creating a complex landscape where statistical correlations can trigger legal liabilities. In the context of Artificial Intelligence (AI) and automated decision-making, few concepts are as critical—or as misunderstood—as indirect discrimination and the role of proxy variables. As organizations deploy predictive models in recruitment, credit scoring, insurance, and public services, the technical reality is that algorithms often learn discriminatory patterns not because they are explicitly programmed to do so, but because they identify seemingly neutral variables that function as stand-ins for protected characteristics. This article explores the legal definition of indirect discrimination under EU law, examines the mechanics of proxy variables in machine learning, and analyzes why the common industry practice of simply removing sensitive attributes is insufficient to ensure compliance.

The Legal Foundation of Indirect Discrimination in EU Law

To understand the risk profile of AI systems, one must first grasp the legal architecture of non-discrimination in Europe. The primary source of this framework is Council Directive 2000/43/EC (the “Racial Equality Directive”) and Council Directive 2000/78/EC (the “Employment Equality Directive”). These directives prohibit both direct and indirect discrimination. While direct discrimination is relatively straightforward—treating a person less favorably on grounds of race, religion, age, or disability—indirect discrimination is more subtle and far more relevant to algorithmic governance.

Under Article 2(1)(b) of Directive 2000/43/EC, indirect discrimination is defined as follows:

“where an apparently neutral provision, criterion or practice would put persons of a particular racial or ethnic origin, or persons with a particular religion or belief, at a particular disadvantage compared with other persons unless that provision, criterion or practice is objectively justified by a legitimate aim and the means of achieving that aim are appropriate and necessary.”

This definition contains three essential elements that an AI system must navigate:

  1. The Apparent Neutrality: The rule, in this case, an algorithmic model or a specific feature used by it, must look neutral on the surface. It cannot explicitly mention race, gender, or age.
  2. The Disparate Impact: It must result in a significant disadvantage for a protected group. In the EU, the “group comparison” is usually assessed by comparing the proportion of persons in the protected group who are disadvantaged against the proportion of persons not in that group.
  3. The Lack of Objective Justification: The disadvantage must not be a proportionate means of achieving a legitimate aim.

Crucially, the Court of Justice of the European Union (CJEU) has established that the burden of proof shifts in indirect discrimination cases. Once a claimant demonstrates a disparate impact, the burden shifts to the defendant (the organization using the AI) to prove that the practice is not discriminatory. This legal dynamic places a heavy evidentiary burden on organizations using automated systems.

Scope of Application: The General Data Protection Regulation (GDPR)

While non-discrimination directives provide the substantive law, the General Data Protection Regulation (GDPR) provides the procedural and technical safeguards. Article 22 GDPR specifically addresses automated decision-making, granting data subjects the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects or similarly significantly affects them.

Although the GDPR does not explicitly define “discrimination,” it recognizes the risk. Recital 71 states that such processing should be prohibited if it involves profiling based on sensitive data or results in discriminatory effects. Furthermore, Article 9 prohibits processing of “special categories of personal data” (sensitive data) unless specific exceptions apply. However, the intersection of GDPR and non-discrimination law creates a paradox: to check for discrimination (compliance), an organization might need to process sensitive data (which GDPR restricts). European regulators are currently grappling with how to reconcile the “right to non-discrimination” with the “right to data protection.”

The Mechanics of Proxy Variables in AI Systems

In machine learning, a proxy variable is an observable variable that is not the variable of interest but is correlated with it. In the context of discrimination, a proxy variable is a feature in a dataset that is statistically correlated with a protected characteristic (e.g., race, gender, age) and serves as a substitute for it within a predictive model.

AI models, particularly complex ensembles like Random Forests or Deep Neural Networks, are designed to maximize predictive accuracy. They do not possess an inherent understanding of social context or legal boundaries. If a variable correlates with the target outcome (e.g., loan repayment, job success), the model will utilize it, regardless of whether that variable is “sensitive” or “neutral.”

How Proxies Emerge

Proxies emerge because human societies are structured by historical inequalities. Variables that seem innocuous are often deeply embedded in social stratification.

Geographic Proxies

Perhaps the most pervasive proxy in European AI systems is location. While an algorithm is explicitly forbidden from using “race” as an input, it is often permitted to use “postal code” or “neighborhood.” In many European cities, residential segregation means that specific postal codes have high concentrations of specific ethnic or immigrant populations. Furthermore, the CJEU has ruled in cases such as Pranger v. Pensionsversicherungsanstalt that using place of residence to calculate pension benefits can constitute indirect discrimination if it disadvantages non-national workers. An AI model predicting creditworthiness might penalize applicants from a specific district, effectively discriminating based on ethnicity without ever seeing an ethnicity variable.

Behavioral and Digital Proxies

Digital footprints offer a rich ground for proxies. For example:

  • Time of activity: Using the time of day a user interacts with a system might correlate with age or employment status.
  • Device type: The specific model of smartphone used can be a strong proxy for socioeconomic status.
  • Language patterns: Linguistic analysis can inadvertently target specific dialects associated with minority groups.

These variables are not “protected characteristics” under the GDPR, yet they allow the model to reconstruct the protected characteristic with high probability.

The “Fairness” Trade-off

From a technical standpoint, removing sensitive attributes often degrades the accuracy of the model for the majority population. Engineers may argue that using a proxy (like zip code) improves the overall predictive power of the system. However, from a legal standpoint, statistical accuracy is not a defense against discrimination. If the “improved” accuracy comes at the cost of systematically disadvantaging a protected group, the system is legally flawed.

Why Removing Sensitive Attributes is Insufficient

The most common misconception among data scientists and compliance officers is that “fairness” can be achieved simply by dropping protected columns (e.g., `gender`, `race`, `religion`) from the training dataset. This practice, known as “fairness through unawareness,” is widely discredited in both legal theory and machine learning research.

The Persistence of Information

Even if the column `gender` is deleted, the information contained within it is rarely deleted from the dataset. Other variables act as correlates. For instance, in a dataset of historical hiring decisions, variables like “years of experience,” “hobby,” or “university attended” may be heavily gendered due to historical societal trends. The model will learn that “playing rugby” correlates with “male” and “high hire probability,” or “taking a career break” correlates with “female” and “lower promotion probability.”

Research has shown that even when all identifiable demographic information is removed, it is often possible to infer the protected attribute with high accuracy from the remaining data. Therefore, the model retains the discriminatory logic, merely encoded in the weights of the non-sensitive variables.

The Legal View on “Neutral” Criteria

European courts have long recognized that a criterion can be neutral on its face but discriminatory in its effect. The landmark case Bilka-Kaufhaus GmbH v. Weber von Hartz (1986) established the test for indirect discrimination regarding pay. The court focused on the actual impact of the rule, not the intent or the label of the variables used.

Applying this to AI: If an algorithm uses “commuting distance” to predict job retention, and women (who disproportionately bear caregiving responsibilities) tend to live closer to work, the algorithm might penalize those living further away. By removing “gender,” the company has not removed the discrimination; it has merely hidden the mechanism. The legal liability remains.

The Risk of “Redlining”

Without sensitive attributes, models can engage in digital redlining. This occurs when the algorithm draws boundaries around neighborhoods or groups of users based on proxies, effectively denying services to specific demographics. In the insurance sector, for example, telematics data (driving behavior) has been shown to correlate with age and gender. An insurer using telematics might claim neutrality, but if the resulting premiums systematically disadvantage older drivers, it may constitute indirect discrimination.

Regulatory Obligations and Technical Mitigation

Organizations deploying AI in Europe must move beyond simple data hygiene and adopt a robust governance framework that addresses the reality of proxy variables. This involves a combination of legal compliance and technical auditing.

The Requirement for Data Protection Impact Assessments (DPIAs)

Under GDPR Article 35, a Data Protection Impact Assessment is mandatory when processing is likely to result in a high risk to the rights and freedoms of natural persons. The use of new technologies (like AI) and the systematic monitoring of sensitive aspects (like performance at work, economic situation) trigger this requirement.

A compliant DPIA must:

  • Systematically describe the processing operations and the purposes.
  • Assess the necessity and proportionality.
  • Identify and assess risks.

Crucially, the DPIA must explicitly look for risks of discrimination. This means the assessment cannot simply ask “Did we use sensitive data?” It must ask, “Are there variables in our dataset that function as proxies for sensitive data?” and “Does the output of the model disproportionately impact protected groups?”

Algorithmic Impact Assessments (AIAs)

While the EU AI Act (Regulation (EU) 2024/1689) is the upcoming comprehensive legislation, the principles are already emerging. High-risk AI systems (such as those used in recruitment or credit scoring) will be subject to strict obligations. Annex IV of the AI Act requires technical documentation that includes “the general description of the AI system,” “the elements of the system,” and “the measures taken to ensure compliance with EU law on fundamental rights.”

Under the AI Act, providers of high-risk AI systems must conduct a fundamental rights impact assessment before placing the system on the market. This assessment must identify the specific risks to fundamental rights, including the risk of discrimination arising from the use of proxies.

Measuring Disparate Impact

To satisfy regulators, organizations must quantify the impact of their models. The legal standard in the EU is often the “four-fifths rule” (borrowed from US jurisprudence but widely used as a benchmark in Europe) or similar statistical thresholds. This rule suggests that the selection rate for a protected group should not be less than 80% of the rate for the group with the highest rate.

For example, if an AI recruitment tool hires 20% of male applicants, it should hire at least 16% of female applicants. If it does not, there is a presumption of disparate impact. Organizations must run these tests on their training, validation, and live production data.

Strategies for Managing Proxy Discrimination

Addressing proxy discrimination requires a shift from “data exclusion” to “algorithmic correction.” There are several technical and procedural strategies available to European practitioners.

1. Causal Inference and DAGs

The most robust approach is to move from correlation-based modeling to causal modeling. By constructing Causal Graphs (Directed Acyclic Graphs or DAGs), data scientists can map the relationships between variables. This allows them to distinguish between “confounders” (variables that cause both the input and the output) and “mediators” (variables that are part of the causal pathway).

For example, in a hiring model, “education level” might be a mediator if the job genuinely requires it, but it might be a proxy for socioeconomic background if the educational system is biased. Causal inference helps identify which variables to include and which to exclude to avoid biasing the model.

2. Fairness Constraints in Optimization

Modern machine learning libraries allow for the inclusion of fairness constraints directly into the loss function of the model. This technique, often called “fairness-aware machine learning,” penalizes the model during training if it produces disparate impacts.

For instance, a developer can specify a constraint that the False Negative Rate must be equal across protected groups. The model then learns to optimize accuracy while satisfying this legal constraint. This is technically superior to post-processing (fixing the results after the model runs) because it addresses the bias at the source.

3. Synthetic Data and Counterfactual Testing

Organizations should utilize synthetic data generation to test how models behave under different demographic scenarios. By creating “counterfactual” versions of a dataset—where the only change is a protected attribute (imputed via proxies)—analysts can isolate the bias introduced by the model. This is particularly useful when real data is scarce or when privacy laws prevent the use of actual sensitive data.

4. Human-in-the-Loop and Explainability

Under GDPR, data subjects have a right to an explanation of automated decisions. Article 13(2)(f) requires that the data controller provide information about the existence of automated decision-making, including “meaningful information about the logic involved.”

If a model relies on complex proxies, it becomes difficult to explain the logic in a way that is “meaningful” to a human. If an applicant is rejected by an AI, and the reason given is “based on your digital footprint,” this is likely insufficient under EU law. Organizations must be able to trace the decision back to specific, non-proxy variables or accept that the model is too opaque to be used in high-stakes scenarios.

National Implementations and Cross-Border Nuances

While EU directives set the minimum standards, member states implement them differently. This creates a fragmented regulatory landscape for organizations operating across borders.

The United Kingdom

Post-Brexit, the UK retains the Equality Act 2010, which mirrors EU directives on indirect discrimination. However, the UK’s approach to AI regulation is diverging. The UK government has favored a principles-based, pro-innovation approach, delegating oversight to existing regulators (like the Equality and Human Rights Commission) rather than creating a single AI regulator. In the UK, the burden of proof in indirect discrimination cases remains on the employer once a prima facie case is made, making the defense “the AI used a proxy, not a sensitive attribute” highly risky.

Germany

Germany has some of the strictest data protection and labor laws in Europe. The Federal Data Protection Act (BDSG-new) supplements GDPR with specific restrictions. In German labor law, the use of algorithms to monitor employees or make hiring decisions is subject to strict co-determination rights of the Works Council (Betriebsrat). A company cannot simply deploy an AI hiring tool in Germany without the Works Council’s approval, specifically regarding the “factual and technical systems” used to monitor employees. This provides a built-in check against proxy discrimination before the system even goes live.

France

France has integrated the EU AI Act provisions early through its “Loi 2023-566” regarding the digital republic. French regulators, particularly the CNIL (National Commission for Informatics and Liberty), have been very active in defining “privacy-preserving” methods for checking discrimination. They emphasize that the use of sensitive data for the purpose of checking for discrimination (algorithmic auditing) is a legitimate interest under GDPR, provided strict safeguards are in place.

The Netherlands

The Dutch “Algorithmic Accountability Act” (currently in development) aims to impose mandatory risk assessments for high-risk AI systems. The Dutch approach focuses heavily on the transparency of the “logic” of the system. This puts pressure on organizations to document exactly why they chose specific variables and to prove that those variables are not proxies for protected characteristics.

Practical Steps for Compliance

For professionals managing AI systems in Europe, the path forward requires a multidisciplinary approach. Legal teams cannot simply hand off compliance to data scientists, and data scientists cannot rely solely on legal guidelines to build models.

Step 1: Variable Audit

Before model training begins, conduct a variable audit. Every input feature should be scrutinized for its potential correlation with protected characteristics. This involves domain expertise. For example, a credit risk analyst knows that “number of credit cards” might correlate with age. A hiring manager knows that “membership in a specific student society” might correlate with gender or class.

Step 2: Proxy Detection Analysis

Run statistical tests to measure the correlation between non-sensitive features and protected attributes (if available for auditing purposes) or proxies. Techniques like Variable Importance Analysis combined with demographic inference can reveal if the model is relying too heavily on a specific zip code or university.

Step 3: Documentation of “Legitimate Aim”

If a proxy variable is deemed necessary (e.g., “commuting distance” is relevant for a logistics job), the organization must document the objective justification. This requires proving that the aim is legitimate (e.g., ensuring employees arrive on time) and that the means are appropriate and necessary (e.g., there is no less discriminatory way to ensure punctuality). This documentation is the

Table of Contents
Go to Top