24 Apr, 2026
TLDR: instead of extracting a .tar.gz archive, we can generate a small index file which lists the size and offset of each file in the tar, and use this metadata to mount the tar blob directly via Emscripten’s WORKERFS without any copying.
For details see: https://github.com/jeroen/tar-vfs-index
The struggle with tarballs
Lots of data on the internet lives in tarballs, often distributed as gzipped .tar.gz files. To get to this data, we have to download the entire .tar.gz file, decompress it, and then iterate through the blob from beginning to end to make copies of the files we need. This is expensive and painful in memory constrained environments.
A while ago we came up with a cool optimization for WebR (the wasm port of R) that lets us mount contents from a .tar.gz archive without copying by using a metadata file which indexes the size and offset of each file within the tar blob. This works very well and has been a big usability improvement: all R packages for webR are now distributed this way and load much faster, while still being hosted as plain old .tar.gz files on static servers.
The idea of (memory) mapping tarballs is not new, but using a format that we can plug straight into emscripten’s virtual filesystem makes this practical for use in WebAssembly. The metadata files are simple json, which you could either store as static files on your server or generate on demand for any tarball.
In our case we eventually decided it makes sense to append metadata file to the original tarball (tar allows this) and distribute it as a single file (see below for more details).
Emscripten’s virtual filesystem
Emscripten provides a virtual POSIX filesystem (VFS) so that file I/O from C/C++ code works in WebAssembly without modification. This is important for WebR because R interacts a lot with files on disk, in particular for loading R packages.
... continue reading