The internet is getting harder to archive because the AI boom has caused a storage crisis, with both NAND and mechanical drives facing shortages. The same large-capacity HDDs now cost up to 3x more due to shriveled production capacities that have otherwise been entirely booked out by hyperscalers. These rising prices have made it difficult to preserve data at the usual rate across the industry, as reported by 404 Media.
The Internet Archive, whose mission is to provide "universal access to all knowledge," is one of the organizations affected by this crisis. It holds around 210 petabytes of archives, with another 100 terabytes added every day to collections like the Wayback Machine. Amidst the AI boom, maintaining it has become "a very real issue costing us time and money," founder Brewster Kahle told 404 Media.
The 28-30TB hard drives ideal for the job are simply out of stock or available at a grossly inflated price. Fortunately, Internet Archive has active donors and a passionate community of bit-rot fighters that help alleviate some of these concerns, but only by finding workarounds. The organization is also trying to source drives from manufacturers, but they're likely busy with backorders instead.
Latest Videos From
Wikimedia Foundation, the parent non-profit behind Wikipedia, shares similar sentiments, explaining how maintaining over 65 million articles already requires careful budget allocation, which the current turbulence has only exacerbated. A spokesperson told 404 Media that it sees "the primary impact in the purchase of memory and hard drives but also in terms of lead times on server deliveries and our capacity to place future orders."
Beyond the shortage, the AI boom has managed to affect archival efforts in another way that's likely not reversible: scraping. LLMs are trained on huge chunks of data often acquired from the internet, sometimes even illegally. As you'd expect, a lot of sites don't appreciate being randomly scraped to become part of some AI's learning material, so they've put up countermeasures that prevent companies from doing so.
Archiving the internet shares the same first step — it needs to extract information in order to preserve it, but website operators have been increasingly blocking such efforts. Bots that would otherwise scrape a site just to produce a snapshot for educational purposes are now being treated the same way as a bot looking to gather information for artificial intelligence, unintentionally or not.
People in the community who contribute to preservation efforts are also having to think twice about what to preserve. Since hard drives are so expensive now, even enthusiasts part of the r/DataHoarders subreddit are doom-posting about how they've stopped archiving entirely, waiting for prices to level out. You can occasionally find deals, but seeing a large-capacity drive at MSRP has become nearly impossible.
Stay On the Cutting Edge: Get the Tom's Hardware Newsletter Get Tom's Hardware's best news and in-depth reviews, straight to your inbox. Contact me with news and offers from other Future brands Receive email from us on behalf of our trusted partners or sponsors
Those are regular individuals struggling to keep up with rising costs, while the larger non-profits are still managing to scrape by (pun intended). But what about the players in the middle? End of Term Archive, dedicated to archiving government websites between different administrations, is holding onto hope that things will settle down by the time they need to upgrade.
... continue reading