llms.txt & AI Crawlers: How It Differs from robots.txt, Sitemaps, and Canonicals

Prompt Architect Editorial Team · 2026-06-19 · 9 min

TL;DR — AI crawlers like GPTBot and ClaudeBot already visit your site. Here's what llms.txt actually is, how it compares to robots.txt and sitemaps, and what to do first.

A new wave of crawlers is reading your website right now — not just Googlebot, but bots that feed AI training sets and power answers inside ChatGPT, Claude, Perplexity, and Gemini. Naturally, a new file has appeared in the conversation: llms.txt. It is easy to assume llms.txt is the "robots.txt for AI" and that adding it will get your content into AI answers. That assumption is mostly wrong.

This guide separates what is real and standardized from what is experimental. We will cover the AI crawlers that actually exist, how to control them with robots.txt (which works today), what llms.txt is and where its limits are, and how it differs from sitemaps and canonical tags. The goal is a practical plan you can act on without overselling unproven tactics.

Not all AI bots want the same thing, and the distinction matters when you decide what to allow or block.

  • Training crawlers collect content to build or update model training datasets. Example: OpenAI's GPTBot, Anthropic's ClaudeBot. Google uses a robots.txt token called Google-Extended to let you opt out of having content used for Gemini/Vertex AI training, and Apple offers Applebot-Extended for similar opt-out control.
  • Search/answer crawlers fetch pages to ground real-time answers and cite sources. Example: OpenAI's OAI-SearchBot (ChatGPT search) and PerplexityBot.
  • User-action fetchers retrieve a page because a person asked the assistant to look at it. Example: OpenAI's ChatGPT-User and Anthropic's Claude-User.

The practical takeaway: blocking a training crawler does not necessarily block a search crawler, and blocking a search crawler removes you from that assistant's cited answers. These are different decisions with different trade-offs, which is why a single "block all AI" instinct usually backfires.

These are all identified by their User-Agent string and obey robots.txt rules — the same mechanism the web has used for decades. That is the lever that actually works today.

Controlling AI crawlers via robots.txt

robots.txt lives at your site root (/robots.txt) and is the real, supported way to allow or disallow specific crawlers. Each bot reads the rules under its own User-agent line. Here is a realistic example that allows search/citation bots while opting out of training:

# Allow normal search engines
User-agent: *
Allow: /

# Opt OUT of Google's AI training (Gemini/Vertex), keep normal Search
User-agent: Google-Extended
Disallow: /

# Opt OUT of Apple's AI training, keep normal Applebot
User-agent: Applebot-Extended
Disallow: /

# Block OpenAI training crawler entirely
User-agent: GPTBot
Disallow: /

# Keep OpenAI's search/citation bot so you can appear in ChatGPT answers
User-agent: OAI-SearchBot
Allow: /

# Block Anthropic's training crawler, allow user-initiated fetches
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-User
Allow: /

# Allow Perplexity to cite you
User-agent: PerplexityBot
Allow: /

Sitemap: https://example.com/sitemap.xml

Two things to remember. First, Google-Extended is a training-only token — disallowing it does not remove you from Google Search or from AI Overviews built on the regular index. Second, robots.txt is a voluntary protocol. Well-known crawlers from major providers honor it, but it is a request, not an enforced firewall. For hard enforcement you need server-side or WAF-level blocking.

What llms.txt is — and its limits

llms.txt is a proposal by Jeremy Howard (Answer.AI) introduced in 2024 (documented at llmstxt.org). The idea: place a Markdown file at your root that gives language models a clean, curated map of your most important content — a short description plus links to key pages — so a model working with limited context can find the good stuff without wading through navigation and ads.

A minimal llms.txt looks like this:

# Prompt Architect

> Practical guides on prompt engineering and AI search optimization.

## Core guides
- [SEO, AEO & GEO guide](https://example.com/blog/seo-aeo-geo-ai-search-optimization-guide): the pillar overview
- [Measuring AI search performance](https://example.com/blog/ai-search-measurement-search-console-ga4): GA4 + Search Console

## Optional
- [Changelog](https://example.com/changelog)

Now the honest part. llms.txt is not an official web standard. It is a community convention, and there is no clear evidence that major search or AI engines actually read and act on it. Google has explicitly said that no special file or markup is required to appear in AI features — standard SEO plus structured data is the foundation. So treat llms.txt as an experimental, low-cost supplement, not a ranking lever. It is cheap to add and harmless, but do not expect measurable traffic from it, and do not let it crowd out the basics that demonstrably work.

robots.txt vs. sitemap.xml vs. canonical vs. llms.txt

These four are often lumped together, but they solve different problems. Confusing them leads to wasted effort.

File / tag Purpose Standardized? Honored by major engines?
robots.txt Tell crawlers what they may or may not fetch (including AI bots) Yes (long-standing) Yes, widely honored
sitemap.xml List URLs you want discovered/indexed, with metadata Yes (sitemaps.org) Yes
rel="canonical" Tell engines the preferred version of duplicate/similar pages Yes (supported signal) Yes (a hint, not a command)
llms.txt Curated Markdown map of key content for LLMs No (2024 proposal) Unclear / unproven

The key distinction: robots.txt, sitemap.xml, and canonical tags are established, engine-supported mechanisms. llms.txt is an aspirational convention without confirmed adoption. Spend your time on the first three; add the fourth only as a bonus.

Training opt-out vs. search visibility: the trade-off

This is the decision most teams get wrong. Blocking training crawlers (via GPTBot, ClaudeBot, Google-Extended, Applebot-Extended) keeps your content out of future model training. That can protect proprietary or premium material. But there is a separate cost to consider on the search side.

If you block the answer/search crawlersOAI-SearchBot, PerplexityBot, and the user-fetch agents — you remove yourself from those assistants' cited answers entirely. For a content site that wants referral traffic and brand visibility inside AI answers, that is usually the wrong move.

A balanced default for most publishers: opt out of training where you care about it, but keep search/citation bots allowed. And keep one sober expectation in mind — Pew Research reported in 2025 that users click source links less often when an AI summary is present. Being cited is not the same as getting traffic. Visibility in AI answers is real, but it does not convert like a classic blue-link result.

Practical recommendation: what to do first

Work from proven to experimental, in this order:

  1. Get the fundamentals right. Clean robots.txt, a current sitemap.xml, correct canonical tags, and basic structured data (e.g., Article). This is what Google says actually matters for AI features.
  2. Decide your AI crawler policy deliberately. Use the robots.txt example above. Separate training opt-out from search visibility — don't accidentally block the bots that put you in answers.
  3. Write answer-first, citation-friendly content. This is where measurable gains live. See our companion guide on SEO, AEO & GEO for the full framework.
  4. Add llms.txt only after the above — as a cheap, optional experiment, not a priority.
  5. Measure. Track AI-engine referrals in GA4 and watch Search Console. Our guide on measuring AI search performance walks through the setup, including why AI Overviews are hard to isolate.

Common misconceptions

  • "llms.txt is the robots.txt for AI." No. robots.txt is a standard that AI crawlers actually obey. llms.txt is an unproven 2024 proposal with no confirmed engine adoption.
  • "Adding llms.txt gets me into AI answers." There is no evidence of that. Google states no special file is needed for AI features.
  • "Blocking Google-Extended removes me from Google." No. It only opts out of AI training. Regular Search and index-based AI Overviews are unaffected.
  • "Blocking GPTBot blocks ChatGPT search." No. GPTBot is the training crawler; OAI-SearchBot handles search, and ChatGPT-User handles user-initiated fetches. They are separate User-Agents.
  • "Being cited equals traffic." Not reliably — users often read the AI summary and skip the link.
  • "I need FAQ/HowTo schema to show up in AI." Google retired HowTo rich results in 2023 and limited FAQ rich results to select sites. The schema can still help machine readability, but don't expect rich results from it.

Ask AI the right way (prompt tips)

Use these prompts to audit and plan your own crawler policy. Replace the bracketed parts with your real domain and content.

Review this robots.txt for an AI crawler policy. My goal: opt OUT of
AI training (OpenAI, Anthropic, Google, Apple) but stay VISIBLE in
ChatGPT search and Perplexity citations. List any rule that conflicts
with that goal and rewrite the file. Here is my current robots.txt:
[paste robots.txt]
I run a [topic] content site that wants referral traffic from AI
answers. Given that llms.txt is an unproven 2024 proposal and Google
says no special file is required for AI features, give me a prioritized
5-step plan, from proven tactics to experimental ones. Be explicit
about what likely won't move the needle.

Want feedback on a prompt before you run it? Try Prompt Architect's prompt analyzer.