Why AI Disclosure Standards Matter
1. ai.txt: The Opt-Out File for AI Training
Proposed by: Spawning.ai (creators of HaveIBeenTrained.com)
Status: Experimental
- Granular: Differentiate between content types
- Familiar: Based on a known format (text file at root)
- Empowering: Gives creators a voice in training consent
- Not standardized or legally binding
- Depends entirely on AI companies to respect it
- Overlap/confusion with robots.txt and other standards
text: disallow
images: disallow
code: allow
2. llms.txt: Help AI Find the Right Content
Purpose: Unlike ai.txt, which is about opt-out, llms.txt is about opting in and guiding AI agents to your most relevant and valuable content.
Proposed by: Jeremy Howard (Fast.ai, Answer.AI)
Status: Experimental but gaining attention
Pros:
- Proactive: Helps AI find and understand your best content
- Lightweight: Markdown format, easy to implement
- Doesn’t interfere with search indexing
- Requires manual creation/curation
- Relies on AI agents checking for it
Use Case: You create a curated index of your most useful resources, such as:
## Services
- [SEO Guide](https://yourdomain.com/seo.md): How we help businesses rank on Google.
- [Contact Us](https://yourdomain.com/contact.md): Reach out for quotes.
3. learners.txt: Don’t Train on Me
Purpose: Another opt-out mechanism, learners.txt, is designed to tell machine learning agents not to use your content for training.
Proposed by: Daphne Ippolito & Yun William Yu (academic researchers)
Status: Conceptual/academic proposal
Pros:
- Simple: Same format as robots.txt
- Specific: Clear instructions about the training intent
- No adoption yet by major AI companies
- Redundant with other proposals
User-agent: *
Disallow: /
4. Meta Tags: noai, noimageai, and noml
A. noai and noimageai
Purpose: HTML meta tags added to individual pages to signal AI bots not to use content for training (e.g., DeviantArt uses this).
Status: Informal, limited adoption
- Easy to implement per page
- Supported by CMS plugins
- Not widely respected
- Not standardized
B. noml
Purpose: A newer proposal from Mojeek and others to introduce a noml (no machine learning) meta tag.
Example:
- Cleaner naming convention than noai
- Growing support via open letters and GitHub repos
- Not yet adopted by major search engines
5. TDM Reservation Protocol (tdmrep.json)
Purpose: A legally motivated, structured opt-out protocol for AI training and text/data mining, especially aligned with EU copyright law.
Proposed by: STM Association and W3C Community Group
Status: Actively adopted by large publishers (Elsevier, Springer, etc.)
- .well-known/tdmrep.json file
- Optional meta tags and headers for page-level control
{
"policies": [
{
"path": "/articles/*",
"tdm-reservation": true,
"tdm-policy": "https://yourdomain.com/terms"
}
]
}
- Legally grounded
- Detailed path-based control
- Widely adopted in publishing
- Complex to implement manually
- Lesser-known outside legal/commercial publishers
6. Company-Specific Robots.txt Rules (e.g., GPTBot, Google-Extended)
Purpose: AI giants like OpenAI and Google have published user-agent strings so you can block their AI-specific crawlers.
How to use: Add these to your robots.txt file.
For OpenAI (ChatGPT):
User-agent: GPTBot
Disallow: /
For Google (Bard, SGE):
User-agent: Google-Extended
Disallow: /
- Supported by the companies themselves
- Simple and effective
- Company-specific (you need to keep track of new AI bots)
- Doesn’t cover non-cooperative actors
This is currently the most practical and enforceable method if you want to allow or block specific AI bots from training on your site.
Comparing the Standards
Standard | Purpose | Voluntary? | Granularity | Adoption Level |
---|---|---|---|---|
ai.txt | Opt-out training | Yes | Medium | Low |
llms.txt | Opt-in discovery | Yes | High | Experimental |
learners.txt | Opt-out training | Yes | Low | Conceptual |
noai/meta | Opt-out training | Yes | Per Page | Niche (e.g. DeviantArt) |
noml | Opt-out training | Yes | Per Page | Proposal stage |
TDMRep | Opt-out training | Yes | High | High (publishers) |
Robots.txt | Opt-out by bot | Yes | Medium | High (GPTBot, Google) |
What Should You Use?
1. Be visible to AI tools (ChatGPT, Perplexity, etc.)
- Use llms.txt to highlight your best content
- Allow GPTBot and Google-Extended in robots.txt
2. Prevent AI model training on your content
- Block known AI bots like GPTBot and Google-Extended in robots.txt
- Use ai.txt, noai, or noml meta tags
- Implement tdmrep.json if you’re in publishing or the EU
3. Future-proof your content strategy
- Monitor emerging standards and adopt the best-fit for your content
- Join conversations on GitHub and forums where these standards evolve
- robots.txt for immediate access control
- llms.txt to promote your content to LLMs
- TDMRep, if you require legal coverage
- Meta tags (noai, noml) as additional signals