Skip to content
Tech News
← Back to articles

Bluesky April 2026 Outage Post-Mortem

read original more articles
Why This Matters

The Bluesky outage highlights the critical importance of robust observability and monitoring in maintaining service reliability. The incident underscores how overlooked issues, like port exhaustion and insufficient monitoring, can lead to significant downtime, affecting users and the broader tech ecosystem. Addressing these vulnerabilities is essential for building resilient social media platforms and ensuring consistent user experience.

Key Takeaways

Hey all! I'm Jim, and I do system-y things at Bluesky. I'm here to give you some details about what happened on Monday of this week that caused Bluesky to go down intermittently for ~1/2 our users for about 8 hours.

First, I'd like to apologize to our users for the interruption in service. This is easily the worst outage we've seen in my time here. It's just not acceptable.

Second, if you find this work interesting, we're hiring!

The Problem

The issue actually started earlier that weekend. Here's the Bluesky AppView's requests chart for the days leading up to the really bad day (Monday):

The yellow/green isn't important, but those dips are super nasty! They represent real user-facing downtime. Ouch!

We got a page on Saturday April 4. I took a look, thinking it was likely a transit issue. We have pretty extensive network monitoring, and it all looked clear.

I did, however, notice a spike in log lines like this in our AppView data backend (called the "data plane"):

{ "time" : "2026-04-03T22:16:07.944910324Z" , "level" : "ERROR" , "msg" : "failed to set post cache item" , "uri" : "at://did:plc:mhvcx2z27zq2jtb3i7f5beb7/app.bsky.feed.post/3mim4uloar22m" , "error" : "dial tcp 127.32.0.1:0->127.0.0.1:11211: bind: address already in use" }

The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we're exhausting ports, that's a huge problem.

... continue reading