More than DNS: Learnings from the 14 hour AWS outage

More Than DNS: The 14 hour AWS us-east-1 outage

14 minute read

I’m on the right, biting a nail nervously. We’re in an Italian hotel because this happened on day 1 of our offsite.

On Monday the AWS us-east-1 region had its worst outage in over 10 years. The whole thing lasted over 16 hours and affected 140 AWS services, including, critically, EC2. SLAs were blown, an eight-figure revenue reduction will follow. Before Monday, I’d spent around 7 years in industry and never personally had production nuked by a public cloud outage. I generally regarded AWS’s reliability as excellent, industry-leading.

What the hell happened?

A number of smart engineers have come to this major bust-up and covered it with the blanket of a simple explanation: brain drain; race condition; it’s always DNS; the cloud is unreliable, go on-prem. You’re not going to understand software reliability if you summarize an outage of this scale in an internet comment. Frankly, I’m not even going to understand it after reading AWS’s 4000 word summary and thinking about it for hours. But I’m going to hold the hot takes and try.

I wrote Modal’s internal us-east-1 incident postmortem before AWS published their “service disruption summary”: https://aws.amazon.com/message/101925. Our control plane being in us-east-1, we got hit hard. Along with hundreds of other affected companies, we’re interested in a peek under the hood of the IaaS we depend on.

Arriving a few days after the outage, this public summary is a small window into the inner workings of the most experienced hyperscaler engineering operation in the world. I’ll analyze each of the three outage phases, call out key features, and then try, with limited information, to derive a lesson or two from this giant outage. Before proceeding, it is recommended to read the summary carefully.

Out of one service outage, one hundred and forty service outages are born

How did a DynamoDB service failure at 6:48AM UTC October 20th become a 140 service failure epidemic?

... continue reading