by Avi Kumar

| 5 mins read

May 2, 2025

The Future of AI Behavior Standards: A Deep Dive into Web-Based AI Disclosures

AI (Artificial Intelligence)

As artificial intelligence transforms how we search, learn, and create online, website owners face an urgent new question:

How do you tell AI systems what they can and cannot do with your content?

From ChatGPT and Perplexity to Google Bard and Bing Copilot, LLMs (large language models) are trained and powered by data that often originates from the open web. For years, webmasters could simply rely on robots.txt to manage crawler access. But now, with AI systems performing tasks beyond search indexing—like training models or answering questions—new standards are emerging to guide how content interacts with these machines.

This blog explores the current landscape of web-based AI behavior disclosure standards, including what’s proposed, what’s being adopted, and what you, as a website owner or developer, should consider to future-proof your web presence.

Why AI Disclosure Standards Matter

As more users rely on AI tools instead of visiting websites directly, your content might be used without credit, consent, or even awareness. LLMs like ChatGPT are changing how people consume content. Instead of browsing websites, users ask questions and get direct answers synthesized from online data.

That’s a huge problem, especially for businesses, creators, and publishers who rely on content performance. If your site’s content is being used, but you don’t want it to be, or if you want AI agents to use it responsibly, you need a mechanism to signal your preferences.

The problem? There’s no one standard—yet. However, several contenders are emerging, including ai.txt, llms.txt, learners.txt, noai meta tags, and the TDM Reservation Protocol.

Let’s break them down.

1. ai.txt: The Opt-Out File for AI Training

Purpose: Similar to robots.txt, this file, placed at the root of your domain, lets you declare whether different types of content (text, images, audio, video) can be used for training AI models.

Proposed by: Spawning.ai (creators of HaveIBeenTrained.com)

Status: Experimental

Pros:

Granular: Differentiate between content types
Familiar: Based on a known format (text file at root)
Empowering: Gives creators a voice in training consent

Cons:

Not standardized or legally binding
Depends entirely on AI companies to respect it
Overlap/confusion with robots.txt and other standards

Example:

				
					text: disallow
images: disallow
code: allow

If respected, this file gives AI bots a clear signal about what to include in training data—but right now, it’s more of a wish list unless AI companies agree to honor it.

2. llms.txt: Help AI Find the Right Content

Purpose: Unlike ai.txt, which is about opt-out, llms.txt is about opting in and guiding AI agents to your most relevant and valuable content.

Proposed by: Jeremy Howard (Fast.ai, Answer.AI)

Status: Experimental but gaining attention

Pros:

Proactive: Helps AI find and understand your best content
Lightweight: Markdown format, easy to implement
Doesn’t interfere with search indexing

Cons:

Requires manual creation/curation
Relies on AI agents checking for it

Use Case: You create a curated index of your most useful resources, such as:

				
					## Services
- [SEO Guide](https://yourdomain.com/seo.md): How we help businesses rank on Google.
- [Contact Us](https://yourdomain.com/contact.md): Reach out for quotes.

This is ideal for businesses, blogs, and documentation sites that want their content to be understood and retrieved accurately by AI systems.

3. learners.txt: Don’t Train on Me

Purpose: Another opt-out mechanism, learners.txt, is designed to tell machine learning agents not to use your content for training.

Proposed by: Daphne Ippolito & Yun William Yu (academic researchers)

Status: Conceptual/academic proposal

Pros:

Simple: Same format as robots.txt
Specific: Clear instructions about the training intent

Cons:

No adoption yet by major AI companies
Redundant with other proposals

Example:

				
					User-agent: *
Disallow: /

4. Meta Tags: noai, noimageai, and noml

A. noai and noimageai

Purpose: HTML meta tags added to individual pages to signal AI bots not to use content for training (e.g., DeviantArt uses this).

Status: Informal, limited adoption

Example:

				
					<meta name="robots" content="noai, noimageai">

Pros:

Easy to implement per page
Supported by CMS plugins

Cons:

Not widely respected
Not standardized

B. noml

Purpose: A newer proposal from Mojeek and others to introduce a noml (no machine learning) meta tag.

Example:

				
					<meta name="robots" content="noml">

Pros:

Cleaner naming convention than noai
Growing support via open letters and GitHub repos

Cons:

Not yet adopted by major search engines

Meta tags are easy to deploy but may not be parsed unless the AI seeks them. Still, they’re worth implementing for now.

5. TDM Reservation Protocol (tdmrep.json)

Purpose: A legally motivated, structured opt-out protocol for AI training and text/data mining, especially aligned with EU copyright law.

Proposed by: STM Association and W3C Community Group

Status: Actively adopted by large publishers (Elsevier, Springer, etc.)

Format:

.well-known/tdmrep.json file
Optional meta tags and headers for page-level control

Example JSON snippet:

				
					{
  "policies": [
    {
      "path": "/articles/*",
      "tdm-reservation": true,
      "tdm-policy": "https://yourdomain.com/terms"
    }
  ]
}

Pros:

Legally grounded
Detailed path-based control
Widely adopted in publishing

Cons:

Complex to implement manually
Lesser-known outside legal/commercial publishers

If you’re a publisher or enterprise, this is the gold standard. Even Common Crawl respects it.

6. Company-Specific Robots.txt Rules (e.g., GPTBot, Google-Extended)

Purpose: AI giants like OpenAI and Google have published user-agent strings so you can block their AI-specific crawlers.

How to use: Add these to your robots.txt file.

For OpenAI (ChatGPT):

				
					User-agent: GPTBot
Disallow: /

For Google (Bard, SGE):

				
					User-agent: Google-Extended
Disallow: /

Pros:

Supported by the companies themselves
Simple and effective

Cons:

Company-specific (you need to keep track of new AI bots)
Doesn’t cover non-cooperative actors

This is currently the most practical and enforceable method if you want to allow or block specific AI bots from training on your site.

Not sure which one’s right for your business? Kuware can help you choose and implement the best AI policy for your website. Let’s connect!

Comparing the Standards

Standard	Purpose	Voluntary?	Granularity	Adoption Level
ai.txt	Opt-out training	Yes	Medium	Low
llms.txt	Opt-in discovery	Yes	High	Experimental
learners.txt	Opt-out training	Yes	Low	Conceptual
noai/meta	Opt-out training	Yes	Per Page	Niche (e.g. DeviantArt)
noml	Opt-out training	Yes	Per Page	Proposal stage
TDMRep	Opt-out training	Yes	High	High (publishers)
Robots.txt	Opt-out by bot	Yes	Medium	High (GPTBot, Google)

What Should You Use?

With multiple emerging standards, deciding which approach fits your needs can be tricky. Here’s a practical breakdown based on your goals and content strategy.

If you want to…

1. Be visible to AI tools (ChatGPT, Perplexity, etc.)

Use llms.txt to highlight your best content
Allow GPTBot and Google-Extended in robots.txt

2. Prevent AI model training on your content

Block known AI bots like GPTBot and Google-Extended in robots.txt
Use ai.txt, noai, or noml meta tags
Implement tdmrep.json if you’re in publishing or the EU

3. Future-proof your content strategy

Monitor emerging standards and adopt the best-fit for your content
Join conversations on GitHub and forums where these standards evolve

There’s no universal, enforceable AI behavior standard yet. What we have is a growing patchwork of voluntary conventions. However, with AI models becoming more central to how we interact with information, this space is ripe for standardization.

In the meantime, a good strategy is to use:

robots.txt for immediate access control
llms.txt to promote your content to LLMs
TDMRep, if you require legal coverage
Meta tags (noai, noml) as additional signals

The web is becoming AI-native. Whether you’re trying to opt in or opt out, now’s the time to stake your claim.

Want Help Creating an llms.txt or AI Policy for Your Website? Get help from our industry experts at Kuware. We can craft a tailored solution based on your site structure, goals, and content strategy.

Book your FREE AI Policy Consultation with Kuware and protect your content in the age of intelligent automation.

Avi Kumar

Avi Kumar is a marketing strategist, AI toolmaker, and CEO of Kuware, InvisiblePPC, and several SaaS platforms powering local business growth.

Read Avi’s full story here.

Science Of Marketing®

The Future of AI Behavior Standards: A Deep Dive into Web-Based AI Disclosures

Why AI Disclosure Standards Matter

1. ai.txt: The Opt-Out File for AI Training

2. llms.txt: Help AI Find the Right Content

3. learners.txt: Don’t Train on Me

4. Meta Tags: noai, noimageai, and noml

A. noai and noimageai

B. noml

5. TDM Reservation Protocol (tdmrep.json)

6. Company-Specific Robots.txt Rules (e.g., GPTBot, Google-Extended)

Comparing the Standards

What Should You Use?

1. Be visible to AI tools (ChatGPT, Perplexity, etc.)

2. Prevent AI model training on your content

3. Future-proof your content strategy

Avi Kumar

Science Of Marketing^®