Science Of Marketing®
The Future of AI Behavior Standards: A Deep Dive into Web-Based AI Disclosures
A man looks at an AI technology display.

The Future of AI Behavior Standards: A Deep Dive into Web-Based AI Disclosures

As artificial intelligence transforms how we search, learn, and create online, website owners face an urgent new question:
How do you tell AI systems what they can and cannot do with your content?
From ChatGPT and Perplexity to Google Bard and Bing Copilot, LLMs (large language models) are trained and powered by data that often originates from the open web. For years, webmasters could simply rely on robots.txt to manage crawler access. But now, with AI systems performing tasks beyond search indexing—like training models or answering questions—new standards are emerging to guide how content interacts with these machines.
This blog explores the current landscape of web-based AI behavior disclosure standards, including what’s proposed, what’s being adopted, and what you, as a website owner or developer, should consider to future-proof your web presence.

Why AI Disclosure Standards Matter

As more users rely on AI tools instead of visiting websites directly, your content might be used without credit, consent, or even awareness. LLMs like ChatGPT are changing how people consume content. Instead of browsing websites, users ask questions and get direct answers synthesized from online data.
That’s a huge problem, especially for businesses, creators, and publishers who rely on content performance. If your site’s content is being used, but you don’t want it to be, or if you want AI agents to use it responsibly, you need a mechanism to signal your preferences.
The problem? There’s no one standard—yet. However, several contenders are emerging, including ai.txt, llms.txt, learners.txt, noai meta tags, and the TDM Reservation Protocol.
Let’s break them down.

1. ai.txt: The Opt-Out File for AI Training

Purpose: Similar to robots.txt, this file, placed at the root of your domain, lets you declare whether different types of content (text, images, audio, video) can be used for training AI models.

Proposed by: Spawning.ai (creators of HaveIBeenTrained.com)

Status: Experimental

Pros:
  • Granular: Differentiate between content types
  • Familiar: Based on a known format (text file at root)
  • Empowering: Gives creators a voice in training consent
Cons:
  • Not standardized or legally binding
  • Depends entirely on AI companies to respect it
  • Overlap/confusion with robots.txt and other standards
Example:
				
					text: disallow
images: disallow
code: allow
				
			
If respected, this file gives AI bots a clear signal about what to include in training data—but right now, it’s more of a wish list unless AI companies agree to honor it.

2. llms.txt: Help AI Find the Right Content

Purpose: Unlike ai.txt, which is about opt-out, llms.txt is about opting in and guiding AI agents to your most relevant and valuable content.

Proposed by: Jeremy Howard (Fast.ai, Answer.AI)

Status: Experimental but gaining attention

Pros:

  • Proactive: Helps AI find and understand your best content
  • Lightweight: Markdown format, easy to implement
  • Doesn’t interfere with search indexing
Cons:
  • Requires manual creation/curation
  • Relies on AI agents checking for it

Use Case: You create a curated index of your most useful resources, such as:

				
					## Services
- [SEO Guide](https://yourdomain.com/seo.md): How we help businesses rank on Google.
- [Contact Us](https://yourdomain.com/contact.md): Reach out for quotes.
				
			
This is ideal for businesses, blogs, and documentation sites that want their content to be understood and retrieved accurately by AI systems.

3. learners.txt: Don’t Train on Me

Purpose: Another opt-out mechanism, learners.txt, is designed to tell machine learning agents not to use your content for training.

Proposed by: Daphne Ippolito & Yun William Yu (academic researchers)

Status: Conceptual/academic proposal

Pros:

  • Simple: Same format as robots.txt
  • Specific: Clear instructions about the training intent
Cons:
  • No adoption yet by major AI companies
  • Redundant with other proposals
Example:
				
					User-agent: *
Disallow: /
				
			

4. Meta Tags: noai, noimageai, and noml

A. noai and noimageai

Purpose: HTML meta tags added to individual pages to signal AI bots not to use content for training (e.g., DeviantArt uses this).

Status: Informal, limited adoption

Example:
				
					<meta name="robots" content="noai, noimageai">

				
			
Pros:
  • Easy to implement per page
  • Supported by CMS plugins
Cons:
  • Not widely respected
  • Not standardized
B. noml

Purpose: A newer proposal from Mojeek and others to introduce a noml (no machine learning) meta tag.

Example:

				
					<meta name="robots" content="noml">

				
			
Pros:
  • Cleaner naming convention than noai
  • Growing support via open letters and GitHub repos
Cons:
  • Not yet adopted by major search engines
Meta tags are easy to deploy but may not be parsed unless the AI seeks them. Still, they’re worth implementing for now.

5. TDM Reservation Protocol (tdmrep.json)

Purpose: A legally motivated, structured opt-out protocol for AI training and text/data mining, especially aligned with EU copyright law.

Proposed by: STM Association and W3C Community Group

Status: Actively adopted by large publishers (Elsevier, Springer, etc.)

Format:
  • .well-known/tdmrep.json file
  • Optional meta tags and headers for page-level control
Example JSON snippet:
				
					{
  "policies": [
    {
      "path": "/articles/*",
      "tdm-reservation": true,
      "tdm-policy": "https://yourdomain.com/terms"
    }
  ]
}

				
			
Pros:
  • Legally grounded
  • Detailed path-based control
  • Widely adopted in publishing
Cons:
  • Complex to implement manually
  • Lesser-known outside legal/commercial publishers
If you’re a publisher or enterprise, this is the gold standard. Even Common Crawl respects it.

6. Company-Specific Robots.txt Rules (e.g., GPTBot, Google-Extended)

Purpose: AI giants like OpenAI and Google have published user-agent strings so you can block their AI-specific crawlers.

How to use: Add these to your robots.txt file.

For OpenAI (ChatGPT):

				
					User-agent: GPTBot
Disallow: /
				
			

For Google (Bard, SGE):

				
					User-agent: Google-Extended
Disallow: /
				
			
Pros:
  • Supported by the companies themselves
  • Simple and effective
Cons:
  • Company-specific (you need to keep track of new AI bots)
  • Doesn’t cover non-cooperative actors

This is currently the most practical and enforceable method if you want to allow or block specific AI bots from training on your site.

Not sure which one’s right for your business? Kuware can help you choose and implement the best AI policy for your website. Let’s connect!

Comparing the Standards

StandardPurposeVoluntary?GranularityAdoption Level
ai.txtOpt-out trainingYesMediumLow
llms.txtOpt-in discoveryYesHighExperimental
learners.txtOpt-out trainingYes LowConceptual
noai/metaOpt-out trainingYesPer PageNiche (e.g. DeviantArt)
nomlOpt-out trainingYesPer PageProposal stage
TDMRepOpt-out trainingYesHighHigh (publishers)
Robots.txtOpt-out by botYesMediumHigh (GPTBot, Google)

What Should You Use?

With multiple emerging standards, deciding which approach fits your needs can be tricky. Here’s a practical breakdown based on your goals and content strategy.
If you want to…

1. Be visible to AI tools (ChatGPT, Perplexity, etc.)

  • Use llms.txt to highlight your best content
  • Allow GPTBot and Google-Extended in robots.txt

2. Prevent AI model training on your content

  • Block known AI bots like GPTBot and Google-Extended in robots.txt
  • Use ai.txt, noai, or noml meta tags
  • Implement tdmrep.json if you’re in publishing or the EU

3. Future-proof your content strategy

  • Monitor emerging standards and adopt the best-fit for your content
  • Join conversations on GitHub and forums where these standards evolve
There’s no universal, enforceable AI behavior standard yet. What we have is a growing patchwork of voluntary conventions. However, with AI models becoming more central to how we interact with information, this space is ripe for standardization.
In the meantime, a good strategy is to use:
  • robots.txt for immediate access control
  • llms.txt to promote your content to LLMs
  • TDMRep, if you require legal coverage
  • Meta tags (noai, noml) as additional signals
The web is becoming AI-native. Whether you’re trying to opt in or opt out, now’s the time to stake your claim.
Want Help Creating an llms.txt or AI Policy for Your Website? Get help from our industry experts at Kuware. We can craft a tailored solution based on your site structure, goals, and content strategy.
Book your FREE AI Policy Consultation with Kuware and protect your content in the age of intelligent automation.