< All Topics
Print

Generative AI Worldwide: Training Data, Copyright, Transparency, and Safety Controls

The global regulatory landscape for generative artificial intelligence is not a monolith; it is a fragmented mosaic of competing philosophies, legal traditions, and strategic economic priorities. For professionals deploying or developing foundation models and generative systems, understanding these divergences is no longer an academic exercise—it is a prerequisite for operational viability. The European Union has chosen a path of comprehensive, horizontal legislation embedded within the AI Act, imposing specific obligations on General Purpose AI (GPAI) providers. The United States relies on a patchwork of sectoral enforcement, executive orders, and high-stakes copyright litigation to shape boundaries. China utilizes a stringent, state-centric approach focused on content security and ideological alignment. Meanwhile, the United Kingdom, Japan, and Singapore are attempting to carve out innovation-friendly niches, prioritizing flexibility over rigid prescriptive rules. This analysis dissects these approaches through the critical lenses of training data legality, disclosure expectations, watermarking, and safety evaluation practices.

The European Union: A Comprehensive Framework for GPAI

The European approach is defined by the AI Act (Regulation (EU) 2024/1689), which establishes a risk-based framework. While the Act regulates various AI systems, it introduces a distinct category for General Purpose AI (GPAI) models—the technical backbone of modern generative AI. The EU’s philosophy is rooted in the concept of “trustworthy AI,” mandating compliance with fundamental rights, democracy, and safety standards before market entry.

Training Data Legality and Copyright

Under the AI Act, providers of GPAI models must adhere to obligations regarding the data used for training. While the Act does not explicitly ban the use of copyrighted material for training, it requires that providers put in place a policy to comply with EU copyright and related rights law. Specifically, this relates to the text and data mining exceptions.

Crucially, the Act codifies a right for rightsholders to reserve their rights. If a rightsholder opts out of text and data mining in an appropriate manner, the provider must respect this reservation. This creates a dynamic tension: providers must demonstrate that they have either obtained permission or that their use falls within the permissible scope of the Directive on Copyright in the Digital Single Market (2019/790). In practice, this means robust documentation of data sources and the implementation of “opt-out” recognition mechanisms (honoring machine-readable reservations like those signaled via robots.txt or specific metadata) is becoming a compliance baseline.

The obligation to respect copyright reservations is not merely a civil liability risk; under the AI Act, it is a regulatory requirement that can trigger market surveillance action.

Transparency Obligations: Disclosure and Watermarking

The EU places a heavy emphasis on the “accountability” of generative systems. For all GPAI models, providers must:

  • Draw up and publish a summary of the content used for training.
  • Ensure outputs are marked in a machine-readable format that detects AI-generated content.

The requirement to publish a training data summary is a novel regulatory burden. It is intended to allow copyright holders to exercise their rights effectively. However, the granularity of this summary is a subject of ongoing debate. Does it require a list of every dataset, or is a high-level description sufficient? The European Commission’s guidelines will likely push for a level of detail that allows for the identification of specific works, which poses significant technical and administrative challenges for model developers.

Regarding watermarking, the AI Act mandates that providers ensure AI-generated content is marked as such. This must be achieved in a way that is “durable” and “machine-readable.” This moves beyond simple visual disclaimers; it requires embedding metadata or cryptographic watermarks that survive editing and republishing. This aligns with the EU’s broader push for transparency in the digital ecosystem, distinct from the more voluntary approaches seen elsewhere.

Safety Evaluation and Systemic Risk

Not all GPAI models are treated equally. Those deemed to present systemic risks—likely those with capabilities equivalent to the most advanced models (e.g., GPT-4 class)—face stricter obligations. These include:

  • Conducting model evaluations and adversarial testing (red-teaming).
  • Assessing and mitigating potential systemic risks (including cybersecurity and bio risks).
  • Reporting serious incidents to the European AI Office.

Unlike the US, where safety testing is often guided by voluntary commitments, in the EU, this is a legal obligation with defined timelines for reporting (15 days after identification).

The United States: Litigation, Executive Power, and Sectoral Nuance

The United States lacks a horizontal, federal AI law equivalent to the AI Act. Instead, the regulatory environment is shaped by three forces: the judiciary (copyright lawsuits), the Executive Branch (Executive Orders and NIST frameworks), and sectoral regulators (FTC, SEC).

Copyright and the “Fair Use” Defense

The central battleground for training data in the US is the federal court system. Major lawsuits (e.g., Andersen v. Stability AI, New York Times v. OpenAI) hinge on the interpretation of fair use under the Copyright Act. Unlike the EU’s “opt-out” regime, the US legal tradition allows for the use of copyrighted material for purposes such as criticism, comment, news reporting, teaching, scholarship, or research.

AI companies argue that training models on vast datasets constitutes “transformative use”—a highly debated legal theory. The outcome of these cases will define the global availability of training data. If US courts rule that training is not fair use, the industry may face a massive retroactive liability crisis, forcing a shift toward licensed data or synthetic data.

Executive Order 14110 and Safety Reporting

In October 2023, President Biden issued Executive Order 14110 on Safe, Secure, and Trustworthy Artificial Intelligence. While not legislation, it directs federal agencies to use existing powers to regulate AI. Key impacts include:

  • Reporting Requirements: Developers of “powerful” foundation models must report safety test results to the Department of Commerce (specifically the US AI Safety Institute).
  • Watermarking: The order directs the development of standards for watermarking AI-generated content.

However, these are directives to agencies, not direct statutory obligations on companies. The durability of these requirements depends on future political administrations, creating a volatile regulatory environment compared to the static nature of EU law.

Agency Enforcement

The Federal Trade Commission (FTC) has been aggressive in using Section 5 of the FTC Act to police “unfair or deceptive” practices related to AI. This includes the failure to disclose AI-generated content or the deployment of biased algorithms. The focus is on consumer protection and competition, rather than the comprehensive rights-based approach of the EU.

China: Security, Ideology, and Pre-Approval

China’s approach is characterized by state control and the prioritization of “social stability.” The regulatory framework has evolved rapidly through the Interim Measures for the Management of Generative Artificial Intelligence Services and the Algorithmic Recommendation Management Provisions.

Content Governance and Training Data

Unlike the EU’s focus on copyright or the US’s focus on fair use, China’s primary concern with training data is ideological and security alignment. Providers must ensure that training data does not contain content that:

  • Endangers national security or national unity.
  • Incites subversion or terrorism.
  • Propagates “historical nihilism” or obscenity.

Furthermore, data quality must adhere to “Core Socialist Values.” This requires rigorous pre-training filtering and active monitoring of outputs. The burden of censorship is placed directly on the provider, with strict liability for non-compliance.

Real-Name Verification and Watermarking

China mandates that users undergo real-name verification to access generative AI services. Watermarking requirements are strict, aimed at preventing the spread of deepfakes and misinformation. The regulations explicitly require the labeling of AI-generated content to maintain “social order.”

Safety Evaluation

Before a generative AI service can be made available to the public in China, it must undergo a security assessment and filing with the Cyberspace Administration of China (CAC). This is a form of pre-market approval that is absent in the US and distinct from the EU’s conformity assessment for high-risk systems. It effectively acts as a gatekeeper, slowing deployment but ensuring alignment with state requirements.

The United Kingdom: The Pro-Innovation Principles Approach

The UK has deliberately chosen not to pass specific legislation for AI, diverging from the EU. Instead, the government relies on a set of Five Principles adapted by existing regulators (such as the ICO, CMA, and Ofcom) through a “context-specific” application.

Copyright and the “Output” Focus

The UK’s approach to copyright is nuanced. Under the Copyright, Designs and Patents Act 1988, text and data mining exceptions exist for non-commercial research. The UK government has consulted on expanding this to a broad exception for commercial purposes, subject to an “opt-out” for rightsholders. This aligns closer to the EU than the US, but the lack of finalized legislation creates uncertainty.

Currently, the UK Intellectual Property Office (IPO) encourages transparency. However, there is no statutory requirement for a “training data summary” as seen in the AI Act. The focus is on the output: ensuring that AI models do not infringe copyright by reproducing substantial parts of protected works.

Safety and Evaluation

The UK’s safety evaluation regime is voluntary but highly influential. The AI Safety Institute (UK AISI) conducts voluntary testing of frontier models, often in collaboration with developers. The government has proposed a “duty of care” for developers regarding the safety of their models, but this remains a policy aspiration rather than codified law. This “wait and see” approach aims to avoid stifling innovation but relies heavily on the goodwill and cooperation of major AI labs.

Japan and Singapore: Innovation-Friendly Governance

Both Japan and Singapore position themselves as neutral, business-friendly hubs for AI development, offering regulatory clarity without the heavy compliance burdens of the EU.

Japan: The “Social Principles” and Copyright Flexibility

Japan operates under the Social Principles of Human-Centric AI, which are non-binding guidelines. The Japanese government has signaled a very permissive stance on copyright for AI training. In 2024, the Ministry of Economy, Trade and Industry (METI) clarified that using copyrighted data for AI training, even for commercial purposes, is generally permissible and does not infringe copyright, provided the output does not violate the rights of the creator.

Regarding watermarking and safety, Japan relies on the G7 Hiroshima AI Process international guidelines. These encourage voluntary adherence to safety standards, watermarking, and risk management rather than imposing strict legal mandates. This creates a low-friction environment for developers to train models.

Singapore: The Model AI Governance Framework

Singapore’s approach is pragmatic and voluntary. The Model AI Governance Framework provides detailed guidance on responsible AI deployment but does not impose legal sanctions for non-compliance (unless the AI application violates existing laws, such as consumer protection or data privacy).

  • Training Data: Singapore emphasizes data quality and lineage but does not mandate specific copyright clearance protocols.
  • Transparency: The framework encourages disclosure to users that they are interacting with AI, but this is a best-practice recommendation, not a legal requirement.
  • Safety: The focus is on “human-in-the-loop” and rigorous testing prior to deployment, managed through internal governance rather than external regulatory reporting.

Singapore has recently launched the AI Verify testing toolkit, allowing companies to voluntarily demonstrate their AI system’s performance against safety benchmarks, providing a pragmatic path to building trust without heavy regulation.

Comparative Analysis: The Friction Points

For multinational organizations, the divergence in these regimes creates significant friction points.

Training Data: The Copyright Chasm

The most significant operational risk lies in the difference between the US “fair use” defense and the EU’s “opt-out” compliance. A model trained in the US under the assumption of fair use may be illegal to deploy in the EU if the provider has not respected machine-readable opt-outs. Conversely, a model strictly adhering to EU opt-outs may be at a competitive disadvantage in the US if US competitors use broader datasets. Japan’s permissive stance offers a potential third way, but models trained there must still comply with the laws of the markets where they are deployed.

Transparency: Disclosure vs. Detection

The EU mandates the publication of training data summaries and the implementation of machine-readable watermarking. The US and UK focus more on the output and user experience (e.g., preventing deception). China mandates watermarking for state security. For a global product, the EU’s machine-readable watermarking standard is likely to become the de facto global baseline due to the “Brussels Effect”—where EU standards become global standards because companies find it easier to comply with the strictest rule everywhere.

Safety: Pre-Market vs. Post-Market

China and the EU (for systemic risk models) introduce elements of pre-market scrutiny—either through filing requirements (China) or conformity assessments (EU). The US and UK rely more on post-market enforcement and voluntary commitments. This means that entering the Chinese or European markets requires a higher degree of upfront safety documentation and testing compared to the US/UK.

Strategic Implications for European Professionals

For professionals operating within Europe, the landscape is clear but demanding. The AI Act is not just a compliance checklist; it is a fundamental restructuring of liability and responsibility for AI systems.

Operationalizing Compliance

To navigate this, organizations must:

  1. Map Data Lineage: You cannot comply with the EU’s copyright policy requirement without knowing exactly what data was used for training. This requires technical infrastructure that tracks data sources and respects opt-out signals.
  2. Implement Technical Watermarking: Relying on “AI-generated” disclaimers in text is insufficient. Engineering teams must integrate metadata embedding (e.g., C2PA standards) into the generation pipeline.
  3. Prepare for Incident Reporting: The 15-day reporting window for systemic risks is tight. Organizations need internal governance structures that can rapidly assess an incident, determine if it meets the threshold of a “serious incident,” and file the report with the EU AI Office.

The Global Patchwork Strategy

There is no single global standard for generative AI regulation. A “lowest common denominator” approach (e.g., complying only with the most lenient jurisdiction) is risky. The extraterritorial reach of the AI Act means that any AI system impacting EU citizens falls under its scope. Therefore, the most viable strategy is to build a “compliance core” based on the strictest requirements—likely the EU AI Act regarding transparency and copyright, and China’s requirements regarding content filtering if operating in that market—and adapt it for local nuances.

Ultimately, the era of unregulated generative AI development is closing. The regulatory “moats” are being dug, and they differ in depth and width depending on the jurisdiction. Professionals must view regulatory compliance not as a tax on innovation, but as a design constraint that shapes the architecture of the next generation of AI systems.

Table of Contents
Go to Top