theregister.com

Wikipedia's overlords bemoan AI bot bandwidth burden

Web-scraping bots have become an unsupportable burden for the Wikimedia community due to their insatiable appetite for online content to train AI models.

Representatives from the Wikimedia Foundation, which oversees Wikipedia and similar community-based projects, say that since January 2024, the bandwidth spent serving requests for multimedia files has increased by 50 percent.

"This increase is not coming from human readers, but largely from automated programs that scrape the Wikimedia Commons image catalog of openly licensed images to feed images to AI models," explained Birgit Mueller, Chris Danis, and Giuseppe Lavagetto, from the Wikimedia Foundation in a public post.

This increase is not coming from human readers

"Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs."

According to the Wikimedians, at least 65 percent of the traffic for the most expensive content served by Wikimedia Foundation datacenters is generated by bots, even though these software agents represent only about 35 percent of page views.

That's due to the Wikimedia Foundation's caching scheme which distributes popular content to regional data centers around the globe for better performance. Bots visit pages without respect to their popularity, and their requests for less popular content means that material has to be fetched from the core data center, which consumes more computing resources.

The heedlessness of ill-behaved bots has been a common complaint over the past year or so among those operating computing infrastructure for open source projects, as the Wikimedians themselves noted by pointing to our recent report on the matter.

Last month, Sourcehut, a Git-hosting service, called out overly demanding web crawlers that snarf content for AI companies. Diaspora developer Dennis Schubert, repair site iFixit, and ReadTheDocs have also objected to aggressive AI crawlers, among others.

Most websites recognize the need to provide bandwidth to serve bot inquiries as a cost of doing business because these scripted visits help make online context easier to discover by indexing it for search engines.

But since ChatGPT came online and generative AI took off, bots have become more willing to stripmine entire websites for content that's used to train AI models. And these models may end up as commercial competitors, offering the aggregate knowledge they've gathered for a subscription fee or for free. Either scenario has the potential to reduce the need for the source website or for search queries that generate online ad revenue.

The Wikimedia Foundation in its 2025/2026 annual planning document, as part of its Responsible Use of Infrastructure section, cites a goal to "reduce the amount of traffic generated by scrapers by 20 percent when measured in terms of request rate, and by 30 percent in terms of bandwidth."

We want to favour human consumption

Noting that Wikipedia and its multimedia repository Wikimedia Commons are invaluable for training machine learning models, the planning document says "we have to prioritize who we serve with those resources, and we want to favour human consumption, and prioritize supporting the Wikimedia projects and contributors with our scarce resources."

How that's to be achieved, beyond the targeted interventions already undertaken by site reliability engineers to block the most egregious bots, is left to the imagination.

As concern about abusive AI content harvesting has been an issue for some time, quite a few tools have emerged to thwart aggressive crawlers. These include: Data poisoning projects such as Glaze, Nightshade, and ArtShield; and network-based tools including Kudurru, Nepenthes, AI Labyrinth, and Anubis.

Last year, when word of the web's discontent with AI crawlers reached the major patrons of AI bots – Google, OpenAI, and Anthropic, among others – there was some effort to provide methods to prevent AI crawlers from visiting websites through the application of robots.txt directives.

But these instructions, stored at the root of websites so they can be read by arriving web crawlers, are not universally deployed or respected. Nor can this optional declarative defensive protocol, if not done via wildcard character to cover every possibility, keep up when a name change is all that's required to evade a block list entry. A common claim among those operating websites is that misbehaving bots misidentify themselves as Googlebot or some other widely tolerated crawler so they don't get blocked.

Wikipedia.org, for example, doesn't bother to block AI crawlers from Google, OpenAI, or Anthropic in its robots.txt file. It blocks a number of bots deemed troublesome for their penchant for slurping whole sites but has failed to include entries for major commercial AI firms.

The Register has asked the Wikimedia Foundation why it hasn't banned AI crawlers more comprehensively. ®

Read full news in source page