Hey all! I'm Jim, and I do system-y things at Bluesky. I'm here to give you some details about what happened on Monday of this week that caused Bluesky to go down intermittently for ~1/2 our users for about 8 hours.
First, I'd like to apologize to our users for the interruption in service. This is easily the worst outage we've seen in my time here. It's just not acceptable.
Second, if you find this work interesting, we're hiring!
The Problem
The issue actually started earlier that weekend. Here's the Bluesky AppView's requests chart for the days leading up to the really bad day (Monday):
The yellow/green isn't important, but those dips are super nasty! They represent real user-facing downtime. Ouch!
We got a page on Saturday April 4. I took a look, thinking it was likely a transit issue. We have pretty extensive network monitoring, and it all looked clear.
I did, however, notice a spike in log lines like this in our AppView data backend (called the "data plane"):
{ "time" : "2026-04-03T22:16:07.944910324Z" , "level" : "ERROR" , "msg" : "failed to set post cache item" , "uri" : "at://did:plc:mhvcx2z27zq2jtb3i7f5beb7/app.bsky.feed.post/3mim4uloar22m" , "error" : "dial tcp 127.32.0.1:0->127.0.0.1:11211: bind: address already in use" }
The timing of these log spikes lined up with drops in user-facing traffic, which makes sense. Our data plane heavily uses memcached to keep load off our main Scylla database, and if we're exhausting ports, that's a huge problem.
... continue reading