Please note that this article is focused merely on read-wise format compatibility. In other words, it establishes how tar files should be written in order to achieve best probability that it will be read correctly afterwards. It does not investigate what formats the listed tools can write and whether they can correctly create archives using specific features.
This naturally raised more questions on how portable various tar formats actually are. To verify that, I have decided to analyze the standards for possible incompatibility dangers and build a suite of test inputs that could be used to check how various implementations cope with that. This article describes those points and provides test results for a number of implementations.
This article is directly inspired by my proof-of-concept work on new binary package format for Gentoo. My original proposal used volume label to provide user- and file(1)-friendly way of distinguish our binary packages. While it is a GNU tar extension, it falls within POSIX ustar implementation-defined file format and you would expect that non-compliant implementations would extract it as regular files. What I did not anticipate is that some implementation reject the whole archive instead.
The tar format is one of the oldest archive formats in use. It comes as no surprise that it is ugly — built as layers of hacks on the older format versions to overcome their limitations. However, given the POSIX standarization in late 80s and the popularity of GNU tar, you would expect the interoperability problems to be mostly resolved nowadays.
For the purpose of the experiment, the following implementations were tested:
The large file test tarballs are double-compressed using gzip. The inner compression is gzip -1, used to reduce the file sizes from 8 GiB to 36 MiB while maintaining reasonable performance (warning! it's a zipbomb!). The outer compression is gzip -9, used to reduce the file size further for the git checkout.
All the test inputs are uploaded to tar-test-inputs repository . They are mostly tarballs produced by either GNU tar or libarchive bsdtar, with a few manually hacked to achieve desired results.
The sun tar format is the format historically used by tar on SunOS. It seems roughly equivalent to pax, except that uppercase X file flag is used in place of lowercase x, and that additional member type is provided for ACLs.
The star format is the format historically used by star implementation, derived from v7 tar incompatibly with both ustar or GNU tar. This format does not carry ustar magic; incompatible implementations normally recognize it as v7 tar then. This format was later superseded by ustar- compatible xstar and xustar formats.
The GNU tar format is derived from the v7 format separately from POSIX formats. It uses the same magic and version as the pre-POSIX ustar format, and is partially compatible with it. However, whereas ustar provides for extending pathname length, GNU tar includes fields for additional timestamps and some other metadata. It also uses a few additional member types to provide long pathnames and support for multi-volume archives.
... continue reading