We are in the age of inference. Billion- to trillion-parameter neural networks are run on specialized accelerators at quadrillions of operations per second to generate media , author software , and fold proteins at massive scale.
Inference workloads are more variable and less predictable than the training workloads that previously dominated. That makes them a natural fit for serverless computing, where applications are defined at a level above the (virtual) machine so that they can be more readily scaled up and down to handle variable load.
But serverless computing only works if new replicas can be spun up quickly — as fast as demand changes, which can be at the scale of seconds. Naïvely spinning up a new instance of, say, SGLang serving a billion-parameter LLM on a B200 can take tens of minutes or stall for hours on GPU availability.
At Modal, we’ve done deep engineering work over the last five years to solve this problem. In this blog post, we walk through what we did.
There are four key ingredients:
Cloud buffers : maintain a small buffer of healthy, idle GPUs to take on new load
: maintain a small buffer of healthy, idle GPUs to take on new load Custom filesystem : serve container images lazily out of a content-addressed, multi-tier cloud-native cache
: serve container images lazily out of a content-addressed, multi-tier cloud-native cache Checkpoint/restore : fast-forward through CPU-side initialization by directly restoring processes into memory
: fast-forward through CPU-side initialization by directly restoring processes into memory CUDA checkpoint/restore : fast-forward through GPU-side initialization by directly restoring CUDA contexts into memory
Together, they take AI inference server replica scaling from multiple kiloseconds to just tens of seconds.
... continue reading