Hypergrowth isn't always easy

A recent Reddit thread noted that Tailscale's uptime has been, uh, shakier than usual in the last month or so, which included the holiday season. I can't deny it. We believe in transparency, so we have our uptime history available on our status page that will confirm it for you.

We are committed to visibility which is why we maintain this public uptime history page. But one challenge of maintaining visibility is it can leave our status updates open to a wide range of interpretations or assumptions. When we say, "coordination server performance issues," is that an outage, or is it just slow? Does it affect everyone or just some people? If Tailscale's coordination service is down, does that mean my connections are broken? And when you say "coordination server" … wait ... surely we run more than one server?

Great questions, and the answers are all kind of tied together. Let's go through them. We don't get enough chances to talk about our system architecture, anyway.

First of all, the history section of the status page actually has more detail than it seems at a glance. Despite the lack of visual affordances, you can click on each incident to get more details. For example, this incident from Jan 5:

Looks like whatever happened took 24 minutes, and affected a small number of tailnets, but it still caused increased latency and prevented some people from carrying out actions. That’s disruptive, and we’re sorry. If you’re wondering why there wasn’t an advance notification, here’s the context. We detected an internal issue early, before it caused user-visible impact, and intervened to repair it. Part of that repair required briefly taking a shard offline, which created a short period of customer impact.

Part of engineering is measuring, writing down what went wrong, and making a list of improvements so it doesn’t go wrong next time. Continuous improvement, basically.

To be clear: this was an outage, and we’re not trying to downplay it. The difference here is in the shape of the failure. Thanks to many person-years of work, it was planned rather than accidental, limited to a small number of tailnets, and for most other tailnets primarily showed up as increased latency rather than broader unavailability. We also resolved it faster than similar incidents in the past. Continuous improvement means measuring blast radius, severity, and time to recovery, and steadily improving them, even as we continue to scale.

We probably should stop referring to a "coordination server" and start calling it a "coordination service." Once upon a time, it was indeed just one big server in the sky. True story: that one big server in the sky hit over a million simultaneously connected nodes before we finally succeeded in sharding it, or spreading the load across multiple servers. As computer science students quickly learn, there are only three numbers: 0, 1, and more than 1. No servers running, one big server, or lots of servers. So now we have lots of servers.

But, unlike many products where each stateless server instance can serve any customer, on Tailscale, every tailnet still sits on exactly one coordination server at any given moment (but can live migrate from one to another). That's because, as we realized maybe five years into the game, a coordination server is not really a server in the classic sense. It's a message bus. And the annoying thing about message buses is they are annoyingly hard to scale without making them orders of magnitude slower.

That thing in Tailscale where you change your ACLs, and they're reflected everywhere on your tailnet, no matter how many nodes you have, usually in less than a second? That's a message bus that was designed for speed. Compared to classic firewalls that need several minutes and a reboot to (hopefully) change settings, it's pretty freakin' awesome. But, that high-speed centralized (per tailnet anyway) message bus design has consequences. One of the consequences is, when the bus eventually has any amount of downtime, no control plane messages are getting passed, for the nodes connected to that instance.

... continue reading