betanews.com

The race against AI web scrapers: effective strategies to protect your data [Q&A]

A surge in artificial intelligence (AI), generative AI (GenAI), and machine learning (ML) technologies is creating a massive online appetite for data. These tools are hungry for training data, this has boosted AI web scraping, which sits in a legal gray zone. Sometimes it's legal, sometimes it's not, but what's clear is that it's having ripple effects across online businesses.

We talked to Nick Rieniets, field CTO of Kasada, to learn more about the impact of web scraping and what companies can do to protect their content.

BN: What impact do web scrapers have on businesses?

NR: The rise in AI is driving a rise in web scrapers. For example, my team has observed daily activity by Bytespyder, OpenAI, and Anthropic, which are the three largest AI data scrapers. Operated by ByteDance (which owns TikTok), Bytespider is allegedly used to collect data to train its LLMs. This summer, across our entire customer base, there was a 20x increase in Bytespider's attempted scraping requests compared to prior months. The reason for the increase is unknown, although you might speculate that they’re attempting to collect as much as possible before some inevitable anti-scraping legislation is instated. Even more recently, Meta's intentions become clear when its new Meta External Agent bot crawls websites to train its AI systems.

AI scrapers aren't just a minor inconvenience; they're hitting businesses where it hurts. They're taking in data (including intellectual property and copyrighted content) and then disseminating it without proper payment or attribution. In attempts to manage unauthorized scraping, companies like Time Magazine are striking licensing deals with AI companies, while others are suing those who are operating the scrapers. Scraping is also driving down web traffic to publishers. In a recent study, publishers projected they'll lose 20 percent to 40 percent of Google-generated traffic. In an industry that relies on content for its business model and suffers from shrinking budgets, scrapers could be a death knell to some publishing businesses.

The significant financial impact of scrapers is undeniable. According to Kasada's 2024 State of Bot Mitigation Report, 37 percent of companies that experienced bot attacks over the last year lost over five percent of their revenue to web scraping.

BN: What techniques do AI scrapers use to avoid detection?

NR: AI scrapers are becoming increasingly sophisticated in evading detection. One common technique involves 'harvesting' user data (e.g. copies of real user sessions and browser data) to disguise their automated activities as legitimate traffic. This helps the scrapers bypass the many detection systems that rely on client-side device fingerprinting. Services like BrightData also enable scrapers to rotate IP addresses and user agents, further complicating detection efforts. These techniques allow AI scrapers to stay ahead of many conventional security measures.

BN: Why aren't traditional bot defenses protecting against web scraping?

NR: Traditional bot defenses are having difficulty keeping up with the latest AI data scrapers. Many of these defenses rely on spotting device fingerprints or analyzing traffic patterns, but scrapers use harvested fingerprints to blend in with real users, making them hard to detect.

Furthermore, most machine learning (ML) based threat detection is too slow for these automated threats. Unlike other automated threats, scraping defenses must classify online behavior as originating from either a bot or a human without any chance to observe and analyze the session interactions. Therefore, behavioral analysis methods, including ML, are often inadequate for detecting scrapers.

Another method failing to prevent web scraping? CAPTCHAs. These puzzles are nearly ubiquitous online, yet according to Kasada's latest report, 57 percent of IT leaders expressed concern over bots’ ability to bypass CAPTCHAs.

Additionally, the barrier to entry to web scraping has been lowered thanks to readily available specialized third-party APIs and services. Those with fewer or poorer technical skills, or those who don't want to do the heavy lifting of building their own scrapers, can turn to companies that sell cost-efficient scraper APIs that evade scraping detections. Alternatively, GenAI companies can simply buy data outright from turnkey services and claim they didn't have a hand in AI data scraping. Ultimately, this means the volume of scraping activity is only increasing, and defenders have to face an ever-growing threat as the AI hype continues.

BN: How can businesses upgrade their defenses?

NR: Businesses should update their websites' robots.txt file to disallow AI scrapers if they choose. It’s the digital equivalent of a ‘no trespassing’ sign. It's relatively easy to do this -- adding scrapers such as Bytespyder or Anthropic to your site deny lists to prevent access to your content. However, scrapers don't have to obey what’s within the robot.txt file. The legal ramifications for scrapers that are caught trespassing vary based on websites' terms of use policies.

Ultimately, organizations need to invest in modern anti-scraping defense solutions that are as quick and dynamic as our adversaries. Many of the approaches that companies rely on today, like CAPTCHAs and device fingerprinting, just aren't cutting it when it comes to stopping AI scrapers.

Image credit: Nmedia/Dreamstime.com

Read full news in source page