← All dispatches
Dispatches · #intelligence · ScrapeOps

Web Scraping for AI in 2026: The Open Web Is Closing Fast

June 20, 2026 · Abhishek Gupta
Infographic showing 51% of 2026 web traffic is AI bots and 22.7% of the web sits behind Cloudflare blocking AI by default

Half the traffic on the internet in 2026 isn't human. AI bots account for 51% of all web traffic — and the web they feed on is being walled off in real time.

That's the trap every AI product is walking into. Web scraping for AI was close to a solved problem two years ago: point a crawler at a page, get the data back. Now the page checks who's asking and, most of the time, says no.

The short version

  • Cloudflare sits in front of 22.7% of all websites and now blocks AI crawlers by default.
  • More than 2.5 million sites explicitly disallow AI training in their robots rules.
  • AI bots make up 51% of web traffic, with AI-crawler volume up roughly 400% in 2025 alone.
  • robots.txt stopped being a fence — about 13% of AI bots ignore it outright.
  • Data acquisition just turned from the easy part of building AI into the hard part.

The Web Stopped Being Free to Read

In July 2025, Cloudflare flipped the default: new domains on its network block AI crawlers unless the owner opts in. It then launched Pay Per Crawl — when a crawler hits a protected page, it gets back an HTTP 402, the long-dormant "Payment Required" code, instead of the content (Search Engine Land).

This is not a fringe setting. Cloudflare sits in front of 22.7% of all websites as of May 2026, and more than a million customers have switched on the one-click AI block (Coronium).

The deals followed the infrastructure. In February 2026, Stack Overflow wired up Cloudflare to return a 402 to specific crawlers — humans still read every answer for free, while commercial AI training gets metered and billed. A decade of programming Q&A went from open training data to a paid API overnight.

Who Is Actually Crawling — and Who Gets Blocked First

The crawler mix tells you where this is heading. AI-specific bots now rival the search engines that built the open web.

AI crawler share of bot traffic in 2026 — Googlebot 38.7%, GPTBot 12.8%, Meta-ExternalAgent 11.6%, ClaudeBot 11.4%, Bingbot 9.7%

Googlebot still leads at 38.7% of identified bot traffic. But GPTBot, Meta-ExternalAgent, and ClaudeBot together land near 35.8% — almost the same footprint, built in about three years.

That visibility makes them targets. GPTBot is the single most-blocked crawler on the web, disallowed by roughly 19% of sites. The pattern is blunt: the better-known your bot, the faster doors close on it.

Blocking Bots Quietly Favors the Biggest Players

Here's the part most people miss. Walling off the web doesn't stop AI — it changes who can afford to feed it.

When data moves behind a 402, the company that can write a check to Stack Overflow, Reddit, or a news group keeps its pipeline. OpenAI and Google have signed those licensing deals. A two-person startup training a vertical model cannot, and a paywall it can't pay is the same as a wall.

So the moat shifts. It moves away from model architecture, which everyone now rents, and toward data acquisition — the unglamorous work of reaching real pages, at scale, without getting blocked, and turning them into something a model can actually use.

What "Web Scraping for AI" Has to Become

The phrase is outdated. What teams need in 2026 isn't a crawler — it's an acquisition layer that manages identity, request rate, deduplication, freshness, and compliance as one system.

That is the exact problem we built ScrapeOps to solve: turn one question into hundreds of relevant, deduplicated, comprehension-ready sources, instead of a folder of half-blocked HTML. The "where's the data" problem is what breaks most AI systems before the model ever runs — and it's now the part getting harder every quarter.

The open web spent thirty years as a free read. The companies that treat its closing as an infrastructure problem — not a licensing line item — are the ones who'll still have data to work with in 2027.

Frequently Asked Questions

Is web scraping for AI still legal in 2026? Public-data scraping remains broadly legal in most jurisdictions, but the ground has shifted from "who is collecting" to "how the data is used." Purpose-based controls and per-crawl licensing now govern most large sites, so commercial AI training increasingly requires permission or payment even when the page is public.

What is Cloudflare Pay Per Crawl? Pay Per Crawl is a Cloudflare feature that returns an HTTP 402 "Payment Required" response to AI crawlers instead of page content. Site owners can charge AI companies for access while keeping the same pages free for human visitors.

Does robots.txt still stop AI crawlers? Not reliably. Around 13% of AI bots ignore robots.txt entirely, and the file was never enforceable — it's a request, not a lock. Site owners have moved to network-level blocking through providers like Cloudflare for control that actually holds.

Why is web data getting harder to collect for AI? Default crawler blocking, paid-access licensing, and bot-detection systems have closed off a large share of the open web in 2026. The data still exists, but reaching it now takes an acquisition layer that handles identity, rate limits, and deduplication — not a simple crawler.


Abhishek Gupta is Co-Founder at Dekrypt Labs, building ScrapeOps — the data acquisition engine that turns any question into clean, deduplicated, comprehension-ready sources. See the full product line or read more dispatches. dekryptlabs.com