arstechnica.com

AI bots strain Wikimedia as bandwidth surges 50%

Automated AI bots seeking training data threaten Wikipedia project stability, foundation says.

Purple cartoon robots superimposed over a green library photo. Purple cartoon robots superimposed over a green library photo.

Credit: Carol Yepes and Dana Neibert via Getty Images

On Tuesday, the Wikimedia Foundation announced that relentless AI scraping is putting strain on Wikipedia's servers. Automated bots seeking AI model training data for LLMs have been vacuuming up terabytes of data, growing the foundation's bandwidth used for downloading multimedia content by 50 percent since January 2024. It’s a scenario familiar across the free and open source software (FOSS) community, as we've previously detailed.

The Foundation hosts not only Wikipedia but also platforms like Wikimedia Commons, which offers 144 million media files under open licenses. For decades, this content has powered everything from search results to school projects. But since early 2024, AI companies have dramatically increased automated scraping through direct crawling, APIs, and bulk downloads to feed their hungry AI models. This exponential growth in non-human traffic has imposed steep technical and financial costs—often without the attribution that helps sustain Wikimedia’s volunteer ecosystem.

The impact isn’t theoretical. The foundation says that when former US President Jimmy Carter died in December 2024, his Wikipedia page predictably drew millions of views. But the real stress came when users simultaneously streamed a 1.5-hour video of a 1980 debate from Wikimedia Commons. The surge doubled Wikimedia’s normal network traffic, temporarily maxing out several of its Internet connections. Wikimedia engineers quickly rerouted traffic to reduce congestion, but the event revealed a deeper problem: The baseline bandwidth had already been consumed largely by bots scraping media at scale.

This behavior is increasingly familiar across the FOSS world. Fedora’s Pagure repository blocked all traffic from Brazil after similar scraping incidents covered by Ars Technica. GNOME’s GitLab instance implemented proof-of-work challenges to filter excessive bot access. Read the Docs dramatically cut its bandwidth costs after blocking AI crawlers.

Wikimedia’s internal data explains why this kind of traffic is so costly for open projects. Unlike humans, who tend to view popular and frequently cached articles, bots crawl obscure and less-accessed pages, forcing Wikimedia’s core datacenters to serve them directly. Caching systems designed for predictable, human browsing behavior don’t work when bots are reading the entire archive indiscriminately.

As a result, Wikimedia found that bots account for 65 percent of the most expensive requests to its core infrastructure despite making up just 35 percent of total pageviews. This asymmetry is a key technical insight: The cost of a bot request is far higher than a human one, and it adds up fast.

Crawlers that evade detection

Making the situation more difficult, many AI-focused crawlers do not play by established rules. Some ignore robots.txt directives. Others spoof browser user agents to disguise themselves as human visitors. Some even rotate through residential IP addresses to avoid blocking, tactics that have become common enough to force individual developers like Xe Iaso to adopt drastic protective measures for their code repositories.

This leaves Wikimedia’s Site Reliability team in a perpetual state of defense. Every hour spent rate-limiting bots or mitigating traffic surges is time not spent supporting Wikimedia’s contributors, users, or technical improvements. And it’s not just content platforms under strain. Developer infrastructure, like Wikimedia’s code review tools and bug trackers, is also frequently hit by scrapers, further diverting attention and resources.

These problems mirror others in the AI scraping ecosystem. Curl developer Daniel Stenberg has detailed how fake, AI-generated bug reports are wasting human time. SourceHut’s Drew DeVault has highlighted how bots hammer endpoints like git logs, far beyond what human developers would ever need.

Across the Internet, open platforms are experimenting with technical solutions: proof-of-work challenges, slow-response tarpits (like Nepenthes), collaborative crawler blocklists (like "ai.robots.txt"), and commercial tools like Cloudflare's AI Labyrinth. These approaches address the technical mismatch between infrastructure designed for human readers and the industrial-scale demands of AI training.

Open commons at risk

Wikimedia acknowledges the importance of providing "knowledge as a service," and its content is indeed freely licensed. But as the Foundation states plainly, "Our content is free, our infrastructure is not."

The organization is now focusing on systemic approaches to this issue under a new initiative: WE5: Responsible Use of Infrastructure. It raises critical questions about guiding developers toward less resource-intensive access methods and establishing sustainable boundaries while preserving openness.

The challenge lies in bridging two worlds: open knowledge repositories and commercial AI development. Many companies rely on open knowledge to train commercial models but don't contribute to the infrastructure making that knowledge accessible. This creates a technical imbalance that threatens the sustainability of community-run platforms.

Better coordination between AI developers and resource providers could potentially resolve these issues through dedicated APIs, shared infrastructure funding, or more efficient access patterns. Without such practical collaboration, the platforms that have enabled AI advancement may struggle to maintain reliable service. Wikimedia's warning is clear: Freedom of access does not mean freedom from consequences.

Read full news in source page