How when AWS was down, we were not

info For help understanding this article or how you can implement auth and similar security architectures in your services, feel free to reach out to me via the community server.

One of the most massive AWS incidents transpired on October 20th. The long story short is that the DNS for DynamoDB was impacted for us-east-1 , which created a health event for the entire region. It's the worst incident we've seen in a decade. Disney+, Lyft, McDonald'ss, New York Times, Reddit, and the list goes on were lining up to claim their share too of the spotlight. And we've been watching because our product is part of our customers critical infrastructure. This one graph of the event says it all:

The AWS post-incident report indicates that at 7:48 PM UTC DynamoDB had "increased error rates". But this article isn't about AWS, and instead I want to share how exactly we were still up when when AWS was down.

Now you might be thinking: why are you running infra in us-east-1?

And it's true, almost no one should be using us-east-1, unless, well, of course, you are us. And that's because we end up running our infrastructure where our customers are. In theory, practice and theory are the same, but in practice they differ. And if our (or your) customers chose us-east-1 in AWS, then realistically, that means you are also choosing us-east-1 😅.

During this time, us-east-1 was offline, and while we only run a limited amount of infrastructure in the region, we have to run it there because we have customers who want it there. And even without a direct dependency on us-east-1 , there are critical services in AWS — CloudFront, Certificate Manager, Lambda@Edge, and IAM — that all have their control planes in that region. Attempting to create distributions or roles at that time were also met with significant issues.

Since there are plenty of articles in the wild talking about what actually happened, why it happened, and why it will continue to happen, I don't need to go into it here. Instead, I'm going to share a dive about exactly what we've built to avoid these exact issues, and what you can do for your applications and platforms as well. In this article, I'll review how we maintain a high SLI to match our SLA reliability commitment even when the infrastructure and services we use don't.

Before I get to the part where I share how we built one of the most reliable auth solutions available. I want to define reliability. And for us, that's an SLA of five nines. I think that's so extraordinary that the question I want you to keep in mind through this article is: is that actually possible? Is it really achievable to have a service with a five nines SLA? When I say five nines, I mean that 99.999% of the time, our service is up and running as expected by our customers. And to put this into perspective, the red, in the sea of blue, represents just how much time we can be down.

And if you can't see it, it's hiding inside this black dot. It amounts to just five minutes and 15 seconds per year. This pretty much means we have to be up all the time, providing responses and functionality exactly as our customers expect.

To put it into perspective, it's important to share for a moment, the specific challenges that we face, why we built what we built, and of course why that's relevant. To do that, I need to include some details about what we're building — what Authress actually does. Authress provides login and access control for the software applications that you write — It generates JWTs for your applications. This means:

... continue reading