Corrosion - GoKawiil

Image by Annie Ruygt

Fly.io transmogrifies Docker containers into Fly Machines: micro-VMs running on our own hardware all over the world. The hardest part of running this platform isn’t managing the servers, and it isn’t operating the network; it’s gluing those two things together.

Several times a second, as customer CI/CD pipelines tear up or bring down Fly Machines, our state synchronization system blasts updates across our internal mesh, so that edge proxies from Tokyo to Amsterdam can keep the accurate routing table that allows them to route requests for applications to the nearest customer instances.

On September 1, 2024, at 3:30PM EST, a new Fly Machine came up with a new “virtual service” configuration option a developer had just shipped. Within a few seconds every proxy in our fleet had locked up hard. It was the worst outage we’ve experienced: a period during which no end-user requests could reach our customer apps at all.

Distributed systems are blast amplifiers. By propagating data across a network, they also propagate bugs in the systems that depend on that data. In the case of Corrosion, our state distribution system, those bugs propagate quickly. The proxy code that handled that Corrosion update had succumbed to a notorious Rust concurrency footgun: an if let expression over an RWLock assumed (reasonably, but incorrectly) in its else branch that the lock had been released. Instant and virulently contagious deadlock.

A lesson we’ve learned the hard way: never trust a distributed system without an interesting failure story. If a distributed system hasn’t ruined a weekend or kept you up overnight, you don’t understand it yet. Which is why that’s how we’re introducing Corrosion, an unconventional service discovery system we built for our platform and open sourced.

Our Face-Seeking Rake

State synchronization is the hardest problem in running a platform like ours. So why build a risky new distributed system for it? Because no matter what we try, that rake is waiting for our foot. The reason is our orchestration model.

Virtually every mainstream orchestration system (including Kubernetes) relies on a centralized database to make decisions about where to place new workloads. Individual servers keep track of what they’re running, but that central database is the source of truth. At Fly.io, in order to scale across dozens of regions globally, we flip that notion on its head: individual servers are the source of truth for their workloads.

In our platform, our central API bids out work to what is in effect a global market of competing “worker” physical servers. By moving the authoritative source of information from a central scheduler to individual servers, we scale out without bottlenecking on a database that demands both responsiveness and consistency between São Paulo, Virginia, and Sydney.

... continue reading