Today is when the Amazon brain drain sent AWS down the spout

column "It's always DNS" is a long-standing sysadmin saw, and with good reason: a disproportionate number of outages are at their heart DNS issues. And so today, as AWS is still repairing its downed cloud as this article goes to press, it becomes clear that the culprit is once again DNS. But if you or I know this, AWS certainly does.

And so, a quiet suspicion starts to circulate: where have the senior AWS engineers who've been to this dance before gone? And the answer increasingly is that they've left the building — taking decades of hard-won institutional knowledge about how AWS's systems work at scale right along with them.

What happened?

AWS reports that on October 20, at 12:11 AM PDT, it began investigating “increased error rates and latencies for multiple AWS services in the US-EAST-1 Region.” About an hour later, at 1:26 AM, the company confirmed “significant error rates for requests made to the DynamoDB endpoint” in that region. By 2:01 AM, engineers had identified DNS resolution of the DynamoDB API endpoint for US-EAST-1 as the likely root cause, which led to cascading failures for most other things in that region. DynamoDB is a "foundational service" upon which a whole mess of other AWS services rely, so the blast radius for an outage touching this thing can be huge.

As a result, much of the internet stopped working: banking, gaming, social media, government services, buying things I don't need on Amazon.com itself, etc.

AWS has given increasing levels of detail, as is their tradition, when outages strike, and as new information comes to light. Reading through it, one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow. To be clear: I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time.

Note that for those 75 minutes, visitors to the AWS status page (reasonably wondering why their websites and other workloads had just burned down and crashed into the sea) were met with an "all is well!" default response. Ah well, it's not as if AWS had previously called out slow outage notification times as an area for improvement. Multiple times even. We can keep doing this if you'd like.

The prophecy

AWS is very, very good at infrastructure. You can tell this is a true statement by the fact that a single one of their 38 regions going down (albeit a very important region!) causes this kind of attention, as opposed to it being "just another Monday outage." At AWS's scale, all of their issues are complex; this isn't going to be a simple issue that someone should have caught, just because they've already hit similar issues years ago and ironed out the kinks in their resilience story.

Once you reach a certain point of scale, there are no simple problems left. What's more concerning to me is the way it seems AWS has been flailing all day trying to run this one to ground. Suddenly, I'm reminded of something I had tried very hard to forget.

... continue reading