The 1979 Design Choice Breaking AI Workloads

Rethinking Container Image Distribution to eliminate cold starts

At Cerebrium, we frequently work with teams building latency-sensitive AI systems, such as voice agents, realtime video avatars and many other interactive AI applications. Many of them arrive after running into the same issue in their own infrastructure: containers that take far too long to start.

The pattern is familiar. A new model version ships, traffic spikes, and the autoscaler spins up new GPU nodes. The cluster has capacity and the workload schedules correctly. But the container still takes seconds, sometimes minutes, to become ready.

The bottleneck is almost always the container image pull time. Modern ML containers routinely exceed 10GB once you include model weights, CUDA libraries, Python dependencies, and serving code. With today’s container image format, every byte must be downloaded and unpacked before the process can begin.

For applications like voice systems, that delay is unacceptable. If a model cannot start quickly enough, the user experiences silence or lag and the interaction often fails.

This is one of the most common cold start problems we see in ML infrastructure. There are many strategies to achieving low-latency ML inference (model optimization, batching strategies, hardware selection, orchestration) but before any of that matters, your container has to actually be running, and right now, the image pull is often the biggest bottleneck in that chain.

So let's start with how we got here: how a tool designed for magnetic tape in 1979 ended up at the center of modern ML infrastructure, and why it's choking on your 11GB image.

A format designed for magnetic tape

The tar utility (short for "tape archive") was written in 1979 for Unix V7 at Bell Labs. Its job was straightforward: concatenate files into a single sequential stream that could be written to magnetic tape. No index, no random access, no seek support. You wrote files to the tape in order, and you read them back in order. That was the whole point; Tape heads move in one direction.

In 1992, the GNU project released gzip as a free replacement for Unix compress (which relied on a patented LZW algorithm). gzip wraps the DEFLATE compression algorithm into a streaming format. Like tar, it's sequential: you compress from the beginning of the file and decompress from the beginning of the file. There's no way to jump into the middle of a gzip archive and start decompressing from an arbitrary offset.

... continue reading