'Block all AI bots' is now actively costing you visibility
The robots.txt advice you copied two years ago is now working against you. Every major AI vendor has split their crawlers in two — one bot that trains on your content, another that powers live search answers. Block everything and you vanish from ChatGPT search results, Claude's retrieval answers, and Perplexity citations. The training bot and the search bot are not the same thing, and treating them as one is quietly costing you visibility.
One Disallow line, two very different bots
When most sites added AI bot blocks they were thinking about training data — and that was
reasonable. But the crawlers have since multiplied. OpenAI now runs three distinct bots.
Anthropic followed in early 2026. A blanket Disallow: / for any of these names
catches both the training crawler and the search/retrieval crawler — and only one of
those harms you to block.
OpenAI's documentation is unambiguous: "Sites that are opted out of OAI-SearchBot will not be shown in ChatGPT search answers." The same logic applies to Claude-SearchBot and Claude-User. Block them and you opt yourself out of citation in real-time AI answers — not just future model training.
| Vendor | Training bot | Search / retrieval bot | Blocking search bot removes you from… |
|---|---|---|---|
| OpenAI | GPTBot | OAI-SearchBot & ChatGPT-User | ChatGPT search answers |
| Anthropic | ClaudeBot | Claude-SearchBot & Claude-User | Claude search & live retrieval answers |
| Google-Extended | Googlebot (same as web search) | Blocking Google-Extended does not remove you from AI Overviews | |
| ByteDance | Bytespider | — | Training only; no confirmed search product |
The Google row deserves a note of its own. Google's AI Overviews are bundled with its core
search crawler — blocking Google-Extended opts you out of Gemini training, but
does nothing to remove you from AI Overviews. The distinction matters when
you are crafting a strategy.
GPTBot leads the blocked list — and keeps growing
TechnologyChecker.io scanned 29.9 million active domains in Q1 2026 and ranked AI crawlers
by share of DISALLOW rules. The result: GPTBot sits at the top, already
outpacing the older CCBot. Most of these blocks were written before the training/search
split was widely understood — meaning a large share of them are silently opt-outing
those sites from AI search citation.
A BuzzStream analysis of 100 top US and UK news publishers (January 2026) found 79% block at least one training bot — yet Google-Extended was the least-blocked training crawler at 46% overall, with US publishers (58%) far ahead of UK publishers (29%). The appetite is there; the precision is not.
416 billion requests blocked in five months
When Cloudflare declared "Content Independence Day" on 1 July 2025 and began blocking AI crawlers by default on new domains, it gave us the first large-scale picture of just how much AI crawling is happening. Cloudflare CEO Matthew Prince disclosed the total at WIRED's Big Interview in December 2025: 416 billion AI bot requests fended off in roughly five months — around 2.8 billion per day. That is not a niche concern; that is a full-scale industrial extraction operation.
Prince also put Google's structural advantage in plain numbers (January 2026, filed alongside the UK CMA consultation): "Google leverages their search monopoly to see 3.2x as much of the web as OpenAI, 4.8x as much as Microsoft, and more than 6x as much as nearly everyone else." That asymmetry means blocking Google crawlers costs far more visibility than blocking any other vendor's equivalent.
Pay-per-crawl: a middle path is forming
Rather than binary allow/block, publishers are beginning to charge for access. Cloudflare's
Pay Per Crawl marketplace (announced 2026) lets publishers return an
HTTP 402 Payment Required response with a crawler-price header; Cloudflare acts
as Merchant of Record. Over a billion such responses are already going out per day on
Cloudflare's network.
Stack Overflow crystallised the nuanced posture in February 2026: a licensing deal with Cloudflare charges for commercial training access while keeping the community's content freely readable. Training data has value; search retrieval helps the community. Charging for one while permitting the other is now a real product decision, not a thought experiment.
Not all blocking is wrong
This post is not arguing for open-door access to everything. There are solid reasons to keep training crawlers out, and you should weigh them:
- Training data has real value. If your content is original and authoritative, training crawlers are extracting that value for free. The Stack Overflow deal sets a precedent for charging rather than gifting.
- Training crawlers give no attribution and no referral traffic. Unlike search-bot citations, training passes leave no footprint — your content improves a model with no acknowledgement and no link back.
- IP and competitive risk. If your content is proprietary or domain-specific enough to constitute a moat, you may not want it baked into a public LLM that any competitor can query.
The argument here is narrow: block training crawlers deliberately if you choose to, but do it by name — not with a sweeping rule that also catches the search/retrieval bots that drive citations and visibility.
One more caveat: robots.txt enforcement is voluntary. Bytespider has a documented history of ignoring it. Spoofers routinely use Chrome user-agents to avoid detection. A robots.txt entry signals your preference and gives you a legal basis for complaint; it does not technically prevent access. Real control requires server-side rules, WAF configuration, and verified-bot signals (Cloudflare's bot management can verify legitimate crawlers by IP and TLS fingerprint before allowing or charging them). Treat robots.txt as a first layer, not the whole defence.
Block smart, not blanket
A robots.txt that distinguishes training from search looks like this:
| Bot | Role | Recommended stance |
|---|---|---|
GPTBot |
OpenAI training | Block if you want to opt out of training data use |
OAI-SearchBot |
ChatGPT search index | Allow — blocking removes you from ChatGPT search answers |
ChatGPT-User |
Live user fetch | Allow — powers real-time browsing in ChatGPT |
ClaudeBot |
Anthropic training | Block if you want to opt out of training data use |
Claude-SearchBot |
Claude search index | Allow — blocking removes you from Claude search answers |
Claude-User |
Live Claude retrieval | Allow — powers live retrieval in Claude |
Google-Extended |
Gemini / SGE training | Block if you want to opt out of Gemini training (does not affect AI Overviews) |
Bytespider |
ByteDance training | Block — no search product, known robots.txt history |
The second step is to check whether your CDN or WAF has already made this decision for you. Cloudflare began blocking AI crawlers by default on all new domains from 1 July 2025. If you set up or migrated a domain after that date, AI crawlers may already be blocked at the network layer — including the search bots you may want to allow. Check your Cloudflare Security → Bots settings before assuming your robots.txt is doing the work.
The third step is to measure your current AI visibility so you know what you have to lose or gain. That is what Baseline's brand scanner does.
Sources: OpenAI platform docs (openai.com/gptbot, openai.com/searchbot); Anthropic support docs on Claude crawlers (coverage: Search Engine Journal); TechnologyChecker.io, "robots.txt AI crawlers blocking report Q1 2026" — technologychecker.io (29.9M domains); BuzzStream, "How Publishers Are Blocking AI" (Jan 2026) — buzzstream.com; Cloudflare "Content Independence Day" press release (Jul 2025) — cloudflare.com; 416B requests stat — Matthew Prince at WIRED Big Interview, Dec 4 2025 (via Tom's Hardware); Prince on Google web access advantage (Jan 2026) via Search Engine Land; Cloudflare Pay Per Crawl — blog.cloudflare.com, AI Crawl Control.