OOMProf: Profiling on the Brink

It was just a little while past the Sunset Strip They found the girl's body in an open pit Her mouth was sewn shut, but her eyes were still wide Gazing through the fog to the other side

"Black River Killer" by Blitzen Trapper

Introduction

This one's personal! For 15 years working on DBMS systems the OOM killer has led to more than its fair share of debugging rabbit holes. Anyone who's been around the block in Linux systems programming has probably crossed paths with the Linux OOM killer. This is the part of the kernel that attempts to maintain forward progress when faced with the impossible situation of applications wanting to use more memory than the system has. The OOM killer balances a lot of competing interests but in the end, it just picks a victim process and kills it.

You're a busy professional so the TLDR is we built an OOM monitoring system called OOMProf in eBPF that profiles Go programs at the point they are OOM kill'd capturing allocations up to the bitter end to give developers a better idea of exactly what went wrong. If you're lazy or unprofessional or just want to know more about how the sausage was made read on! And as an extra bonus we've litered this blog post with stanzas from a great modern folk murder ballad for you to enjoy!

They booked me on a whim and threw me deep in jail With no bail, sitting silent on a rusty pail Just gazing at the marks on the opposite wall Remembering the music of my lover's call

The Problem With OOM Kills

The problem with OOM killed programs is that developers who are faced with explaining what happened have very little to go on. First of all the application that leads to the OOM killer firing may not be the process that gets killed. And in the more common case where the application getting killed is actually the one that did something to deserve being killed the straw that breaks the camels back may be some innocuous common allocation or even some memory allocator induced page fault and unrelated to the actual pile up.

These are often mysterious crashes that linger in the bug base and lead to many developer shoulder shrugs. Long slow leak style bugs can usually be drawn out and spotted with sampling memory profiling solutions (happily available in production environments in most ecosystems these days) but when things go off the rails quickly it can be surprisingly difficult to get clues as to what happened. Like imagine some service dies which starts causing retry loops and work queues to pile up quickly. If you have a distributed system processing millions of things at a time its amazing how quickly things can unravel when certain failures aren't planned for or handled properly. Of course it doesn't have to be complicated, OOMs can occur by simply trying to allocate some huge amount of memory by failing to validate some user input. So even a generously continuous memory profiling solution won't save you when your program dies quickly.

So you make no mistake, I know just what it takes To pull a man's soul back from heaven's gates I've been wandering in the dark about as long as sin But they say it's never too late to start again

... continue reading