With Robots.txt, Websites Halt AI Companies’ Web Crawlers

Most people assume that
generative AI will keep getting better and better; after all, that’s been the trend so far. And it may do so. But what some people don’t realize is that generative AI models are only as good as the ginormous data sets they’re trained on, and those data sets aren’t constructed from proprietary data owned by leading AI companies like OpenAI and Anthropic. Instead, they’re made up of public data that was created by all of us—anyone who’s ever written a blog post, posted a video, commented on a Reddit thread, or done basically anything else online.

A new report from the
Data Provenance Initiative, a volunteer collective of AI researchers, shines a light on what’s happening with all that data. The report, “Consent in Crisis: The Rapid Decline of the AI Data Commons,” notes that a significant number of organizations that feel threatened by generative AI are taking measures to wall off their data. IEEE Spectrum spoke with Shayne Longpre, a lead researcher with the Data Provenance Initiative, about the report and its implications for AI companies.

Shayne Longpre on:

How websites keep out web crawlers, and why

Disappearing data and what it means for AI companies

Synthetic data, peak data, and what happens next

The technology that websites use to keep out web crawlers isn’t new—the robot exclusion protocol was introduced in 1995. Can you explain what it is and why it suddenly became so relevant in the age of generative AI?

Shayne Longpre

Shayne Longpre: Robots.txt is a machine-readable file that crawlers—bots that navigate the web and record what they see—use to determine whether or not to crawl certain parts of a website. It became the de facto standard in the age where websites used it primarily for directing web search. So think of Bing or Google Search; they wanted to record this information so they could improve the experience of navigating users around the web. This was a very symbiotic relationship because web search operates by sending traffic to websites and websites want that. Generally speaking, most websites played well with most crawlers.

Let me next talk about a chain of claims that’s important to understand this. General-purpose AI models and their very impressive capabilities rely on the scale of data and compute that have been used to train them. Scale and data really matter, and there are very few sources that provide public scale like the web does. So many of the foundation models were trained on [data sets composed of] crawls of the web. Under these popular and important data sets are essentially just websites and the crawling infrastructure used to collect and package and process that data. Our study looks at not just the data sets, but the preference signals from the underlying websites. It’s the supply chain of the data itself.

But in the last year, a lot of websites have started using robots.txt to restrict bots, especially websites that are monetized with advertising…

Read full article: With Robots.txt, Websites Halt AI Companies’ Web Crawlers

The post “With Robots.txt, Websites Halt AI Companies’ Web Crawlers” by Eliza Strickland was published on 08/31/2024 by spectrum.ieee.org