Gone but Not Forgotten: Recovering the Dead Web

TL;DR: A Pew Research Center study found that 38% of webpages from a decade ago, and about 25% of pages sampled across the decade, are now inaccessible; our analysis shows that the Wayback Machine has rescued roughly 15% of those otherwise dead pages.

In 2024, the Pew Research Center published a link-rot study, “When Online Content Disappears”. They stated, “38% of webpages that existed in 2013 are no longer accessible a decade later”. They further noted, “a quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible”. This is not an isolated report that quantified the rate of loss of the online information. Numerous other link-rot studies in the last two decades have reported similar numbers or worse, depending on the context and samples. For example, Ahrefs, an SEO company, reported in the same year, “At Least 66.5% of Links to Sites in the Last 9 Years Are Dead”. In 2021, Jonathan Zittrain published an article in the Atlantic, “The Internet Is Rotting”, in which his team analyzed about 2 million external links from New York Times (NYTimes) articles and reported that 25% of deep links have rotted. They further noted that 72% of the older links from 1998 were dead. A recent longitudinal study on link-rot from the Old Dominion University (ODU), “Some URLs Are Immortal, Most Are Ephemeral”, analyzed 27.3 million URL samples from the Wayback Machine since 1996 and reported that about 65% of the sampled URLs were found dead on the live Web, when checked in 2023. Brewster Kahle, the founder of the Internet Archive, has been citing numbers from the early days of the Web and stating the average life of webpages to be anywhere from 40 to 100 days. A 2026 book, “Vanishing Culture: A Report on Our Fragile Cultural Record”, by Messarra et al. highlights underlying causes of numerous recent cultural digital losses while emphasizing the critical roles libraries and archives must play to maintain our cultural history for the future. Different studies have looked at the problem from different perspectives and contexts, hence it is often difficult to compare them side-by-side, but they all agree that an increasing number of links are rotting with the passage of time. However, some of these studies (not all) have failed to acknowledge the existence of Web archives, such as the Wayback Machine, where a portion of the dead Web might be preserved and can be used as a fallback when a reference leads to a broken link.

In this post we go through some of the link-rot studies and look at them from the perspective of the Wayback Machine to see how much of the dead Web can be rescued. Table 1 shows the status of the dead and rescued Web at a glance as sampled by a few different studies.

Study Year Period Samples Dead Rescued Pew (All) 2024 2013-2023 5.4M 26% 16% Pew (General) 2024 2013-2023 1M 27% 13% Zittrain NYT* 2021 2013-2013 88K 40% 38% ODU NYPW 2024 1996-2021 27.3M 65% 65% Table 1: Dead links from various link-rot studies rescued by the Wayback Machine.

* The NYT numbers are based on our recreated dataset.

Let us begin by looking at the study from Pew Research Center. They have generously shared their dataset with us so it was rather trivial for us (after performing some transformations and extractions, as the original dataset was stored in Parquet files) to check the URLs against the Wayback Machine to see if and when each of those were archived the first time. Their dataset contains 5.4 million unique URLs in general, news, government, and Wikipedia references categories sampled from the Common Crawl archive and Wikipedia pages. They also reported on Tweets in their post, but that dataset was not shared with us due to the restrictions posed by the usage policies.

Before we dive into our findings, below are brief descriptions of some terminologies that we will use frequently:

Alive : URLs that return 200 OK HTTP status code when resolved

: URLs that return 200 OK HTTP status code when resolved Dead : URLs that return HTTP error status codes, TCP connection errors, or DNS failures when resolved

: URLs that return HTTP error status codes, TCP connection errors, or DNS failures when resolved Preserved : URLs that are Alive on the live Web as well as present in a Web archive

... continue reading