Thus, a regular web browser loads what seems to be a normal HTML file, and all assets download only when they need to. In this way, a static HTML page can inline anything—such as gigabyte-size media files—but those will not be downloaded until necessary, even while the server sees just a single large HTML file it serves as normal. And because it is self-contained in this way, it is forwards-compatible: no future user or host of a Gwtar file needs to treat it specially, as all functionality required is old standardized web browser/server functionality.
We introduce a new format, Gwtar ( logo ; pronounced “guitar”, .gwtar.html extension), which achieves all 3 properties simultaneously. A Gwtar is a classic fully-inlined HTML file, which is then processed into a self-extracting concatenated file of an HTML + JavaScript header followed by a tarball of the original HTML and assets. The HTML header’s JS stops web browsers from loading the rest of the file, loads just the original HTML, and then hooks requests and turns them into range requests into the tarball part of the file.
Archiving HTML files faces a trilemma: it is easy to create an archival format which is any two of static (self-contained ie. all assets included, no special software or server support), a single file (when stored on disk), and efficient (lazy-loads assets only as necessary to display to a user), but no known format allows all 3 simultaneously.
There are 3 major properties we would like of an HTML archive format, beyond the basics of actually capturing a page in the first place: it should not depend in any way on the original web page, because then it is not an archive and will inevitably break; it should be easy to manage and store, so you can scalably create them and store them for the long run; and it should be efficient, which for HTML largely means that readers should be able to download only the parts they need in order to view the current page.
Linkrot is one of the biggest challenges for long-term websites. Gwern.net makes heavy use of web page archiving to solve this; and due to quality problems and long-term reliability concerns , simply linking to the Internet Archive is not enough, so I try to create & host my own web page archives of everything I link.
No current format achieves all 3. The built-in web browser save-as-HTML format achieves single and efficient, but not static; save-as-HTML-with-directory achieves static partially and efficient, but not single; MHTML, MAFF, SingleFile, & SingleFileZ (a ZIP-compressed variant) achieve static, single, but not efficiency; WARCs/WACZs achieve static and efficient, but not single (because while the WARC is a single file, it relies on a complex software installation like WebRecorder/Replay Webpage to display).
An ordinary ‘save as page HTML’ browser command doesn’t work because “Web Page, HTML Only” leaves out most of a web page; even “Web Page, Complete” is inadequate because a lot of assets are dynamic and only appear when you interact with the page—especially images. If you want a static HTML archive, one which has no dependency on the original web page or domain, you have to use a tool specifically designed for this. I usually use SingleFile. SingleFile produces a static snapshot of the live web page, while making sure that lazy-loaded images are first loaded, so they are included in the snapshot.
SingleFile often produces a useful static snapshot. It also achieves another nice property: the snapshot is a single file, just a simple single .html file, which makes life so much easier in terms of organizing and hosting. Want to mirror a web page? SingleFile it, and upload the resulting single file to a convenient directory somewhere, boom—done forever. Being a single file is important on Gwern.net, where I must host so many files, and I run so many lints and checks and automated tools and track metadata etc. and where other people may rehost my archives.
However, a user of SingleFile quickly runs into a nasty drawback: snapshots can be surprisingly large. In fact, some snapshots on Gwern.net are over half a gigabyte! For example, the homepage for the research project “PaintsUndo: A Base Model of Drawing Behaviors in Digital Paintings” is 485MB after size optimization, while the raw HTML is 0.6MB. It is common for an ordinary somewhat-fancy Web 2.0 blog post like a Medium.com post to be >20MB once fully archived. This is because such web pages wind up importing a lot of fonts, JS, widgets and icons etc., all of which assets must be saved to ensure it is fully static; and then there is additional wasted space overhead due to converting assets from their original binary encoding into Base64 text which can be interleaved with the original HTML.
This is especially bad because, unlike the original web page, anyone viewing a snapshot must download the entire thing. That 500MB web page is possibly OK because a reader only downloads the images that they are looking at; but the archived version must download everything. A web browser has to download the entire page, after all, to display it properly; and there is no lazy-loading or ability to optionally load ‘other’ files—there are no other files ‘elsewhere’, that was the whole point of using SingleFile!
... continue reading