If you run AI models in production, you have a relationship with cold starts whether you want one or not.
A three-minute startup time changes how you scale. You keep GPUs warm that could have been released. You over-provision to avoid making users wait. You stretch cooldown periods because scaling down too quickly creates pain on the next spike. The application starts accumulating complexity around one problem: getting a model ready to serve traffic fast enough.
At Cerebrium, we have been obsessed with the cold start problem since day one. That obsession has pushed us to rethink almost every layer of our infrastructure:
As more companies move large custom AI models into production, they hit the same wall. Our customers run large language models, real-time avatars, transcription models, diffusion models, and other GPU-heavy workloads where startup time can vary from a few seconds to more than five minutes.
Most of that time is spent on work that gets a container ready to serve requests: importing libraries, loading model weights, initializing CUDA, compiling kernels, and warming up the runtime. That is the core problem checkpointing solves. Instead of rebuilding the same runtime from scratch each time a new container starts, we snapshot the fully initialized container - including CPU memory, GPU memory, process state, model weights, and compiled kernels - and restore it directly into a new container in a fraction of the time.
For some workloads, this reduces cold start time by more than 80%!
This post explains how we built CPU and GPU memory checkpointing at Cerebrium, how it works inside our highly customised gVisor-based runtime, and what it took to make real CUDA workloads like vLLM restore reliably and quickly.
Where do the minutes actually go?
It's tempting to think of cold starts as simply “pulling the image”: downloading the application image onto the machine that will run the container. But for AI workloads, that is only the first part of getting a model ready to serve traffic and it is no longer the bottleneck. We have solved the container download problem already. The real cost in a CPU or GPU container is everything that happens after the image is on the machine and the application starts initializing.
That initialization path includes importing Python modules, loading PyTorch, assembling model weights, copying them onto the GPU, and running the framework’s warmup path - torch.compile, CUDA graph capture, KV cache initialization, and whatever else the serving stack needs before it can take traffic.
... continue reading