AI crawlers are hammering sites and nearly taking them offline

Kyle Wiens recognized something was wrong in July when his staff at iFixit, a website outlining how to fix common household items, began receiving alerts about high traffic on their cellphones. The development team behind the website began looking at the tool that tracks their web traffic (as a highly visited website, iFixit regularly keeps an eye on how many people visit the site). “It became pretty clear that it was clogged,” Wiens says. Digging deeper into the data, iFixit employees realized they had been hit with nearly a million queries on the company website in a little over 24 hours, a number that Wiens says was “abnormally high.” They also were able to identify what had caused the issue: It turned out to be a web crawler sent out into the world by Anthropic, makers of the Claude chatbot, to try and gather training data. Wiens is far from alone: A number of websites have begun to take action to fend off crawlers, seeking to avoid the negative impact of being bombarded with requests. An increasing number of websites are putting restrictions on AI crawlers, according to a recent analysis by the Data Provenance Initiative (DPI), a group of AI researchers. In the DPI’s analysis, around one in four tokens from the most critical web domains called upon by crawlers have put up restrictions. And social media is buzzing with complaints about the increasing instances of web crawlers pushing up traffic on websites. Edd Coates is one of those who has raised concerns online. He runs Game UI Database, a database of details taken from games designed to be used as a reference tool. The website was relaunched in early August, gaining large volumes of visitors keen to check it out. But then a few weeks later, the website’s performance declined dramatically, slowing to a crawl. “I thought that was weird, because we had about a quarter of the people visiting the website that did at the relaunch,” says Coates. “And it’s somehow running slower.” Coates and his web developer checked the website’s server logs, which turned up the cause of the problem: a crawler by OpenAI was pouncing on the website. “They were hitting the site so hard,” he says. “It was, like, 200 times a second.” OpenAI doesn’t dispute its GPTBot crawler visited Game UI Database, but does dispute the scale of how frequently their crawler was hitting the website, showing evidence that suggested the number of queries per second was only around three. An OpenAI spokesperson told Fast Company: “We enable publishers to use industry-standard tools to express preferences about access to their websites. By using robots.txt publishers can set time delays and reduce load on their systems, choose to allow access to only certain pages or directories, or opt out entirely. We stopped accessing this website as soon as they updated their robots.txt directions for our bot, as our systems recognized and respected this.” Despite that, Coates felt the impact. “They were essentially siphoning off 80 [gigabytes] a day, or something crazy like that, from us,” he alleges. (Again, OpenAI disagrees with this.) Game UI Database was hosted on its own server, but Coates estimates that the level of traffic he claims came in the aftermath of OpenAI’s crawler hitting the website would have cost him around 800 pounds ($1,000) a day if he were on a commercial web hosting provider. Just as with Wiens and iFixit, Game UI Database blocked access to GPTBot. “Straight away, the website started running completely fine, smooth as butter,” says Coates. Some would say this is just the world in which we live nowadays, where AI companies are seeking more and more data on which to train their models. Wiens is realistic about living and operating a website in 2024. “All of these AI tools are out there calling everybody right now,” he says. “There are polite levels of crawling, and this superseded that threshold.” With understatement, he says “This was quite a bit greater than that.” An Anthropic spokesperson told Fast Company: “Our crawling user agent ClaudeBot respects robots.txt, the industry accepted signal for blocking web crawling.” Wiens believes it was a bug on behalf of Claude that turned an acceptable level of crawling into a more extreme one. But it had an impact nonetheless. “It takes us off engineering work,” he says. As a result, Wiens has changed iFixit’s robots.txt file, which sends commands to any bot or crawler visiting the website, to block the ability for it to be crawled. “I’m looking at our logs now and every single day since then, they have hit our robots.txt file looking for permission to call the site,” he says. The day before we spoke, crawlers hit the website nine times asking for permission to trawl through its data. Such persistent attempts to knock on the door of websites and ask to be let in—only to then pillage the site of its content for training data—is something Coates is less sanguine about than Wiens. “It shows they don’t care,” Coates s

Sep 26, 2024 - 14:00

AI crawlers are hammering sites and nearly taking them offline

Kyle Wiens recognized something was wrong in July when his staff at iFixit, a website outlining how to fix common household items, began receiving alerts about high traffic on their cellphones. The development team behind the website began looking at the tool that tracks their web traffic (as a highly visited website, iFixit regularly keeps an eye on how many people visit the site). “It became pretty clear that it was clogged,” Wiens says.

Digging deeper into the data, iFixit employees realized they had been hit with nearly a million queries on the company website in a little over 24 hours, a number that Wiens says was “abnormally high.” They also were able to identify what had caused the issue: It turned out to be a web crawler sent out into the world by Anthropic, makers of the Claude chatbot, to try and gather training data.

Wiens is far from alone: A number of websites have begun to take action to fend off crawlers, seeking to avoid the negative impact of being bombarded with requests. An increasing number of websites are putting restrictions on AI crawlers, according to a recent analysis by the Data Provenance Initiative (DPI), a group of AI researchers. In the DPI’s analysis, around one in four tokens from the most critical web domains called upon by crawlers have put up restrictions. And social media is buzzing with complaints about the increasing instances of web crawlers pushing up traffic on websites.

Edd Coates is one of those who has raised concerns online. He runs Game UI Database, a database of details taken from games designed to be used as a reference tool. The website was relaunched in early August, gaining large volumes of visitors keen to check it out. But then a few weeks later, the website’s performance declined dramatically, slowing to a crawl. “I thought that was weird, because we had about a quarter of the people visiting the website that did at the relaunch,” says Coates. “And it’s somehow running slower.”

Coates and his web developer checked the website’s server logs, which turned up the cause of the problem: a crawler by OpenAI was pouncing on the website. “They were hitting the site so hard,” he says. “It was, like, 200 times a second.” OpenAI doesn’t dispute its GPTBot crawler visited Game UI Database, but does dispute the scale of how frequently their crawler was hitting the website, showing evidence that suggested the number of queries per second was only around three.

An OpenAI spokesperson told Fast Company: “We enable publishers to use industry-standard tools to express preferences about access to their websites. By using robots.txt publishers can set time delays and reduce load on their systems, choose to allow access to only certain pages or directories, or opt out entirely. We stopped accessing this website as soon as they updated their robots.txt directions for our bot, as our systems recognized and respected this.”

Despite that, Coates felt the impact. “They were essentially siphoning off 80 [gigabytes] a day, or something crazy like that, from us,” he alleges. (Again, OpenAI disagrees with this.) Game UI Database was hosted on its own server, but Coates estimates that the level of traffic he claims came in the aftermath of OpenAI’s crawler hitting the website would have cost him around 800 pounds ($1,000) a day if he were on a commercial web hosting provider.

Just as with Wiens and iFixit, Game UI Database blocked access to GPTBot. “Straight away, the website started running completely fine, smooth as butter,” says Coates.

Some would say this is just the world in which we live nowadays, where AI companies are seeking more and more data on which to train their models. Wiens is realistic about living and operating a website in 2024. “All of these AI tools are out there calling everybody right now,” he says. “There are polite levels of crawling, and this superseded that threshold.” With understatement, he says “This was quite a bit greater than that.” An Anthropic spokesperson told Fast Company: “Our crawling user agent ClaudeBot respects robots.txt, the industry accepted signal for blocking web crawling.”

Wiens believes it was a bug on behalf of Claude that turned an acceptable level of crawling into a more extreme one. But it had an impact nonetheless. “It takes us off engineering work,” he says. As a result, Wiens has changed iFixit’s robots.txt file, which sends commands to any bot or crawler visiting the website, to block the ability for it to be crawled. “I’m looking at our logs now and every single day since then, they have hit our robots.txt file looking for permission to call the site,” he says. The day before we spoke, crawlers hit the website nine times asking for permission to trawl through its data.

Such persistent attempts to knock on the door of websites and ask to be let in—only to then pillage the site of its content for training data—is something Coates is less sanguine about than Wiens. “It shows they don’t care,” Coates says. “I think at the end of the day, they only care about themselves. They only care about lining their own pockets.”

Wiens is also worried, but believes it’s incumbent on both parties to find a solution. “We have to find a way to coexist with the AI tools,” he says. “I don’t think we’re going to, can, or should stop them, but if they take the content and then regurgitate it without providing people with the original source, it’s a real problem.”

The mounting anecdotal evidence has concerned others whose livelihoods could be affected by the rise of AI. “They think that everything is available for them to use,” says Reid Southen, a film concept artist who has been a vocal critic of AI companies on X.

“Nobody’s benefiting from this,” Coates adds. “Everyone loses eventually.”