What we learned from a 22-Day storage bug (and how we fixed it)

One of Mux Video’s most distinguishing features is the ability to Just-In-Time transcode video segments (and thumbnails, storyboards, etc.) during playback. It’s key to our goal of making any uploaded content viewable as quickly as possible, and our customers rely on it to create snappy experiences for their users.

Building a video platform that can do this requires a lot of moving parts: workers handling the actual encoding, storage and replication, low-latency transmission and streaming of segments as we generate them, and CDN caching and distribution to name a few. Of course doing this at scale means doing all of the above and more in a highly distributed system, which inevitably invites our friend Murphy and his ever-present law to the party.

Let’s talk about why we’re here. Between January 8th and February 4th, roughly 0.33% of audio and video segments across all VOD assets played back during this timeframe were served in a corrupted state. The ensuing behavior likely varied between players and depended on the degree to which the segments were incomplete, but in general some viewers experienced brief audio dropouts or visual stuttering during playback. No source video data was lost and all affected assets have been fully remediated.

Nobody likes incidents, and unfortunately nobody is immune to them. We take every incident seriously but this one in particular had a combination of wide-ranging impact and duration that fell short of our standards. We've fixed the immediate causes and remediated every affected asset, but we're still investigating exactly why our systems behaved the way they did under load. We're sharing what we know now because we believe transparency matters more than having all the answers.

You should never have to worry about Mux internals if you're building on our platform, but the challenges here are interesting and provide us an opportunity to be honest about what we're doing to improve.

To set the stage for what went wrong, it helps to know a bit about how our storage and transcoding systems interact.

When we encode renditions for streaming, we read the source frames from a higher-quality source file we store internally and refer to as the “mezzanine.” Segments typically get generated in parallel, so our encoders are often concurrently reading overlapping portions of the same mezzanine file.

At playback time, our HLS delivery services will request a particular segment from our storage system. If it doesn’t exist, a request will be made to our JIT services to generate the segment while the delivery service waits to start receiving the segment data from our storage system. Here is a very high level visual of the request flow:

While we’re here, let’s take a sneak peek at the new Mux Video Storage system. We’re in the final stretches of rolling it out to all customers and production traffic (we owe at least one blog post on it soon), and it consists of the following components:

storage-worker , which acts as a read/write cache in front of object storage.

... continue reading