Ptar: Replacing .tgz for petabyte-scale S3 archives

Hi, I’m Julien, co-founder of Plakar. Before we built this, I spent years as an engineer and later as a manager of infra teams. We handled backups, compliance, and recovery. In every place, startups, big companies, regulated sectors, I saw the same routine: tar -czf archive.tgz /some/folder We all love that command. But in 2025, it can cause trouble. What’s changed since .tgz was invented Back when tar came out in 1979 or even gzip came out in 1994, things were simple: Data was small, just a few megabytes. Storage was local and trusted. Versioning was not a big deal. Archives ran in one pass, so you had to decompress everything to get one file. Now none of that fits our needs. Over the years data grew huge, like terabytes of logs or model checkpoints. We rely on multi‑core work to finish weeks of processing in minutes. We must assume zero trust, so we need proof no one changed anything. Data sits in S3 and other object stores, not on a local disk. We need to track versions and snapshots. And we often want a single file instantly, without waiting for a full decompress. Plain old .tgz was never made for this. Why .tgz does not work with S3 On a traditional POSIX filesystem, many teams run periodic .tgz snapshots of local disks or NFS shares. By contrast, S3 buckets are rarely backed up (a rather short-sighted approach for mission-critical cloud data), and even one-off archives are rarely done. If you want to archive an S3 bucket with tar and gzip , you: Download everything to your machine (generating storage cost and/or storage cost). Run tar. Maybe encrypt separately. Calculate checksums by hand. Upload back your archive somewhere else. Then, if you need to prove integrity or restore just one file, you’re stuck. .tgz can’t help. This process is slow, error-prone, and costly. It does not scale to large datasets or S3 buckets. What we needed instead We realized we needed an archive that could: remove duplicate data automatically to limit storage and transfer costs encrypt by default to protect sensitive data store snapshots and history check integrity with cryptography talk to S3 and other object stores directly let you restore parts of an archive on demand That led us to create Plakar for Backup, it’s storage engine Kloset and now .ptar the flat file version of Kloset. How .ptar works Instead of a simple byte stream, a .ptar archive is a self‑contained, content‑addressed container. Here is what it gives you: deduplication: identical chunks stored once, even across snapshots built‑in encryption: no extra step tamper evidence: any change breaks the archive versioning: keep many snapshots easily S3 native: one command to archive a bucket partial restores and browsing: pick a file without unpacking it all fast targeted restores: grab one file in seconds A simple example Suppose I have 11 GB in my Documents and two copies of the same folder: $ du -sh ~/Documents 11G /Users/julien/Documents $ tar -czf test.tgz ~/Documents ~/Documents Result: about 22 GB compressed. With .ptar : $ plakar ptar -plaintext -o test.ptar ~/Documents ~/Documents Result: about 8 GB. Why? .ptar sees the duplicate folder once. In many real-world datasets, a large amount of data is actually redundant: multiple copies, backups, archives, or repeated files across folders. Traditional tools like tar compress everything, even duplicates, which unnecessarily increases the size of the archive. .ptar works differently: it automatically detects and removes duplicates, so each unique chunk is stored only once, no matter how many times it appears. That is why, in the example above, .ptar produces a much smaller archive than .tgz. At large scale, the space savings become significant. When .tgz still makes sense I admit, .tgz is everywhere: It runs almost anywhere, no dependencies. It is great for small, throwaway archives. But when you need trust, speed, and scale, .ptar is built for 2025. Try .ptar Get the dev build: $ go install github.com/PlakarKorp/[email protected] Then: archive a folder: $ plakar ptar -o backup.ptar ~/Documents archive an S3 bucket: $ plakar ptar -o backup.ptar s3://my-bucket list contents: $ plakar at backup.ptar ls restore files: $ plakar at backup.ptar restore -to ./restore /Documents/config.yaml inspect one file: $ plakar at backup.ptar cat snapshotid:/path/to/file mount a UI: $ plakar at backup.ptar ui About .ptar and Plakar .ptar is part of Plakar, our open‑source backup engine for immutable, deduplicated, and encrypted data. It is in the Plakar CLI today, and soon will ship as a standalone binary if you only need archiving. The code is open source, so feel free to contribute or give feedback. .ptar and Plakar are doing the biggest difference on datasets with lots of redundancy, such as: Backups with multiple versions of the same files or folders Email, photo, or document archives containing duplicates S3 buckets with snapshots, backups, or files shared across projects Scientific datasets or logs where many files are identical or very similar Training datasets for machine learning, where many files are duplicated or very similar across different versions or experiments. Conclusion Archiving has changed. Data is bigger, trust is lower, and we want fast access. If you still use .tgz for all that, you are taking a risk and wasting time/money.

Ptar: Replacing .tgz for petabyte-scale S3 archives

Share this article

Related Articles