Reddit blocks the Internet Archive from crawling its data - here's why

Andriy Onufriyenko/Getty Images

ZDNET's key takeaways

The Internet Archive can now only crawl Reddit's homepage.

Reddit's goal is to block AI firms from scraping Reddit user data.

Publishers (and others) are suing AI companies for copyright infringement.

Reddit is defending its privacy from AI companies that are taking roundabout approaches to scraping its content.

The social media platform, known as a resource where users can post anonymously and find information about virtually any subject, will block the Internet Archive's Wayback Machine from indexing its online data, according to a Monday report from The Verge. The move is in response to the discovery that AI firms, unable to scrape data from Reddit directly due to the platform's prohibitive policies, have instead been retrieving its data from indexed content on the Internet Archive and using it to train models.

The Wayback Machine will now only be able to scrape data from Reddit's homepage, according to The Verge, while access to user profiles, comments, and post detail pages will be blocked.

Launched in 1996, the Internet Archive is a non-profit that operates an enormous digital database of web content. The archive is maintained in part by the Wayback Machine, a piece of web-crawling software that gathers web pages and preserves them as they appeared when they were collected, like digital flies in amber. This serves as a resource for researchers studying the evolution of online culture and digital forensic evidence for law enforcement, among other uses.

What Reddit's move means

... continue reading