AI crawlers, fetchers are blowing up websites; Meta, OpenAI are worst offenders

Cloud services giant Fastly has released a report claiming AI crawlers are putting a heavy load on the open web, slurping up sites at a rate that accounts for 80 percent of all AI bot traffic, with the remaining 20 percent used by AI fetchers. Bots and fetchers can hit websites hard, demanding data from a single site in thousands of requests per minute. I can only see one thing causing this to stop: the AI bubble popping According to the report [PDF], Facebook owner Meta's AI division accounts for more than half of those crawlers, while OpenAI accounts for the overwhelming majority of on-demand fetch requests. Cloudflare creates AI crawler tollbooth to pay publishers READ MORE "AI bots are reshaping how the internet is accessed and experienced, introducing new complexities for digital platforms," Fastly senior security researcher Arun Kumar opined in a statement on the report's release. "Whether scraping for training data or delivering real-time responses, these bots create new challenges for visibility, control, and cost. You can't secure what you can't see, and without clear verification standards, AI-driven automation risks are becoming a blind spot for digital teams." The company's report is based on analysis of Fastly's Next-Gen Web Application Firewall (NGWAF) and Bot Management services, which the company says "protect over 130,000 applications and APIs and inspect more than 6.5 trillion requests per month" – giving it plenty of data to play with. The data reveals a growing problem: an increasing website load comes not from human visitors, but from automated crawlers and fetchers working on behalf of chatbot firms. The report warned, "Some AI bots, if not carefully engineered, can inadvertently impose an unsustainable load on webservers," Fastly's report warned, "leading to performance degradation, service disruption, and increased operational costs." Kumar separately noted to The Register, "Clearly this growth isn't sustainable, creating operational challenges while also undermining the business model of content creators. We as an industry need to do more to establish responsible norms and standards for crawling that allows AI companies to get the data they need while respecting websites content guidelines." That growing traffic comes from just a select few companies. Meta accounted for more than half of all AI crawler traffic on its own, at 52 percent, followed by Google and OpenAI at 23 percent and 20 percent respectively. This trio then has its hands on a combined 95 percent of all AI crawler traffic. Anthropic, by contrast, accounted for just 3.76 percent of crawler traffic. The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent. The story flips when it comes to AI fetchers, which unlike crawlers are fired off on-demand when a user requests that a model incorporates information newer than its training cut-off date. Here, OpenAI was by far the dominant traffic source, Fastly found, accounting for almost 98 percent of all requests. That's an indication, perhaps, of just how much of a lead OpenAI's early entry into the consumer-facing AI chatbot market with ChatGPT gave the company, or possibly just a sign that the company's bot infrastructure may be in need of optimization. While AI fetchers make up a minority of Ai bot requests – only about 20%, says Kumar – they can be responsible for huge bursts of traffic, with one fetcher generating over 39,000 requests per minute during the testing period. "We expect fetcher traffic to grow as AI tools become more widely adopted and as more agentic tools come into use that mediate the experience between people and websites," Kumar told The Register. Perplexity AI, which was recently accused of using IP addresses outside its reported crawler ranges and ignoring robots.txt directives from sites looking to opt out of being scraped, accounted for just 1.12 percent of AI crawler bot and 1.53 percent of AI fetcher bot traffic recorded for the report – though the report noted that this is growing. Kumar decried the practice of ignoring robots.txt notes, telling El Reg, "At a minimum, any reputable AI company today should be honoring robots.txt. Further and even more critically, they should publish their IP address ranges and their bots should use unique names. This will empower site operators to better distinguish the bots crawling their sites and allow them to enforce granular rules with bot management solutions." But he stopped short of calling for mandated standards, saying that industry forums are working on solutions. "We need to let those processes play out. Mandating technical standards in regulatory frameworks often does not produce a good outcome and shouldn't be our first resort." It's a problem large enough that users have begun fighting back. In the face of bots riding roughshod over polite opt-outs like robots.txt directives, webmasters are increasingly turning to active countermeasures like the proof-of-work Anubis or gibberish-feeding tarpit Nepenthes, while Fastly rival Cloudflare has been testing a pay-per-crawl approach to put a financial burden on the bot operators. "Care must be exercised when employing these techniques," Fastly's report warned, "to avoid accidentally blocking legitimate users or downgrading their experience." Kumar notes that small site operators, especially those serving dynamic content, are most likely to feel the effects most severely, and he had some recommendations. "The first and simplest step is to configure robots.txt which immediately reduces traffic from well-behaved bots. When technical expertise is available, websites can also deploy controls such as Anubis, which can help reduce bot traffic." He warned, however, that bots are always improving and trying to find ways around "tarpits" like Anubis, as code-hosting site Codeberg recently experienced. "This creates a constant cat and mouse game, similar to what we observe with other types of bots today," he said. We spoke to Anubis developer Xe Iaso, CEO of Techaro. When we asked whether they expected the growth in crawler traffic to slow, they said: "I can only see one thing causing this to stop: the AI bubble popping. "There is simply too much hype to give people worse versions of documents, emails, and websites otherwise. I don't know what this actually gives people, but our industry takes great pride in doing this." However, they added: "I see no reason why it would not grow. People are using these tools to replace knowledge and gaining skills. There's no reason to assume that this attack against our cultural sense of thrift will not continue. This is the perfect attack against middle-management: unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees. I see no reason that this will continue to grow until and unless the bubble pops. Even then, a lot of those scrapers will probably stick around until their venture capital runs out." Regulation – we've heard of it The Register asked Xe whether they thought broader deployment of Anubis and other active countermeasures would help. Anubis guards gates against hordes of LLM bot crawlers READ MORE They responded: "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming. Ironically enough, most of these AI companies rely on the communities they are destroying. "This presents the kind of paradox that I would expect to read in a Neal Stephenson book from the '90s, not CBC's front page. Anubis helps mitigate a lot of the badness by making attacks more computationally expensive. Anubis (even in configurations that omit proof of work) makes attackers have to retool their scraping to use headless browsers instead of blindly scraping HTML." And who is paying the piper? "This increases the infrastructure costs of the AI companies propagating this abusive traffic. The hope is that this makes it fiscally unviable for AI companies to scrape by making them have to dedicate much more hardware to the problem. In essence: it makes the scrapers have to spend more money to do the same work." We approached Anthropic, Google, Meta, OpenAI, and Perplexity but none provided a comment on the report by the time of publication. ®

AI crawlers, fetchers are blowing up websites; Meta, OpenAI are worst offenders

Share this article

Related Articles