Crawl control · GEO

'Block all AI bots' is now actively costing you visibility

The robots.txt advice you copied two years ago is now working against you. Every major AI vendor has split their crawlers in two — one bot that trains on your content, another that powers live search answers. Block everything and you vanish from ChatGPT search results, Claude's retrieval answers, and Perplexity citations. The training bot and the search bot are not the same thing, and treating them as one is quietly costing you visibility.

George, the Baseline Labs mascot, holding up a stop sign

One Disallow line, two very different bots

When most sites added AI bot blocks they were thinking about training data — and that was reasonable. But the crawlers have since multiplied. OpenAI now runs three distinct bots. Anthropic followed in early 2026. A blanket Disallow: / for any of these names catches both the training crawler and the search/retrieval crawler — and only one of those harms you to block.

OpenAI's documentation is unambiguous: "Sites that are opted out of OAI-SearchBot will not be shown in ChatGPT search answers." The same logic applies to Claude-SearchBot and Claude-User. Block them and you opt yourself out of citation in real-time AI answers — not just future model training.

Vendor Training bot Search / retrieval bot Blocking search bot removes you from…
OpenAI GPTBot OAI-SearchBot & ChatGPT-User ChatGPT search answers
Anthropic ClaudeBot Claude-SearchBot & Claude-User Claude search & live retrieval answers
Google Google-Extended Googlebot (same as web search) Blocking Google-Extended does not remove you from AI Overviews
ByteDance Bytespider Training only; no confirmed search product

The Google row deserves a note of its own. Google's AI Overviews are bundled with its core search crawler — blocking Google-Extended opts you out of Gemini training, but does nothing to remove you from AI Overviews. The distinction matters when you are crafting a strategy.

GPTBot leads the blocked list — and keeps growing

TechnologyChecker.io scanned 29.9 million active domains in Q1 2026 and ranked AI crawlers by share of DISALLOW rules. The result: GPTBot sits at the top, already outpacing the older CCBot. Most of these blocks were written before the training/search split was widely understood — meaning a large share of them are silently opt-outing those sites from AI search citation.

5.52%
of all DISALLOW rules block GPTBot — most-blocked AI crawler
5.08%
block CCBot
4.88%
block ClaudeBot
4.44%
block Google-Extended

A BuzzStream analysis of 100 top US and UK news publishers (January 2026) found 79% block at least one training bot — yet Google-Extended was the least-blocked training crawler at 46% overall, with US publishers (58%) far ahead of UK publishers (29%). The appetite is there; the precision is not.

416 billion requests blocked in five months

When Cloudflare declared "Content Independence Day" on 1 July 2025 and began blocking AI crawlers by default on new domains, it gave us the first large-scale picture of just how much AI crawling is happening. Cloudflare CEO Matthew Prince disclosed the total at WIRED's Big Interview in December 2025: 416 billion AI bot requests fended off in roughly five months — around 2.8 billion per day. That is not a niche concern; that is a full-scale industrial extraction operation.

416B
AI bot requests blocked by Cloudflare, Jul–Dec 2025
~2.8B/day
average AI crawl volume hitting Cloudflare's network
1B+/day
HTTP 402 Pay-Per-Crawl responses already sent via Cloudflare marketplace

Prince also put Google's structural advantage in plain numbers (January 2026, filed alongside the UK CMA consultation): "Google leverages their search monopoly to see 3.2x as much of the web as OpenAI, 4.8x as much as Microsoft, and more than 6x as much as nearly everyone else." That asymmetry means blocking Google crawlers costs far more visibility than blocking any other vendor's equivalent.

Pay-per-crawl: a middle path is forming

Rather than binary allow/block, publishers are beginning to charge for access. Cloudflare's Pay Per Crawl marketplace (announced 2026) lets publishers return an HTTP 402 Payment Required response with a crawler-price header; Cloudflare acts as Merchant of Record. Over a billion such responses are already going out per day on Cloudflare's network.

Stack Overflow crystallised the nuanced posture in February 2026: a licensing deal with Cloudflare charges for commercial training access while keeping the community's content freely readable. Training data has value; search retrieval helps the community. Charging for one while permitting the other is now a real product decision, not a thought experiment.

Not all blocking is wrong

This post is not arguing for open-door access to everything. There are solid reasons to keep training crawlers out, and you should weigh them:

  • Training data has real value. If your content is original and authoritative, training crawlers are extracting that value for free. The Stack Overflow deal sets a precedent for charging rather than gifting.
  • Training crawlers give no attribution and no referral traffic. Unlike search-bot citations, training passes leave no footprint — your content improves a model with no acknowledgement and no link back.
  • IP and competitive risk. If your content is proprietary or domain-specific enough to constitute a moat, you may not want it baked into a public LLM that any competitor can query.

The argument here is narrow: block training crawlers deliberately if you choose to, but do it by name — not with a sweeping rule that also catches the search/retrieval bots that drive citations and visibility.

One more caveat: robots.txt enforcement is voluntary. Bytespider has a documented history of ignoring it. Spoofers routinely use Chrome user-agents to avoid detection. A robots.txt entry signals your preference and gives you a legal basis for complaint; it does not technically prevent access. Real control requires server-side rules, WAF configuration, and verified-bot signals (Cloudflare's bot management can verify legitimate crawlers by IP and TLS fingerprint before allowing or charging them). Treat robots.txt as a first layer, not the whole defence.

Block smart, not blanket

A robots.txt that distinguishes training from search looks like this:

Bot Role Recommended stance
GPTBot OpenAI training Block if you want to opt out of training data use
OAI-SearchBot ChatGPT search index Allow — blocking removes you from ChatGPT search answers
ChatGPT-User Live user fetch Allow — powers real-time browsing in ChatGPT
ClaudeBot Anthropic training Block if you want to opt out of training data use
Claude-SearchBot Claude search index Allow — blocking removes you from Claude search answers
Claude-User Live Claude retrieval Allow — powers live retrieval in Claude
Google-Extended Gemini / SGE training Block if you want to opt out of Gemini training (does not affect AI Overviews)
Bytespider ByteDance training Block — no search product, known robots.txt history

The second step is to check whether your CDN or WAF has already made this decision for you. Cloudflare began blocking AI crawlers by default on all new domains from 1 July 2025. If you set up or migrated a domain after that date, AI crawlers may already be blocked at the network layer — including the search bots you may want to allow. Check your Cloudflare Security → Bots settings before assuming your robots.txt is doing the work.

The third step is to measure your current AI visibility so you know what you have to lose or gain. That is what Baseline's brand scanner does.

Audit your AI visibility

Sources: OpenAI platform docs (openai.com/gptbot, openai.com/searchbot); Anthropic support docs on Claude crawlers (coverage: Search Engine Journal); TechnologyChecker.io, "robots.txt AI crawlers blocking report Q1 2026" — technologychecker.io (29.9M domains); BuzzStream, "How Publishers Are Blocking AI" (Jan 2026) — buzzstream.com; Cloudflare "Content Independence Day" press release (Jul 2025) — cloudflare.com; 416B requests stat — Matthew Prince at WIRED Big Interview, Dec 4 2025 (via Tom's Hardware); Prince on Google web access advantage (Jan 2026) via Search Engine Land; Cloudflare Pay Per Crawl — blog.cloudflare.com, AI Crawl Control.

George
Online
0%

Hi, I'm George.

Ask me about your projects, reports, brand mentions, backlinks, or anything on the platform.