We’ve been working on the Linux version of Superluminal (a CPU profiler) for a while now, and we’ve been in a private alpha with a small group of testers. That was going great, until one of our testers, Aras, ran into periodic full system freezes while capturing with Superluminal.
We always pride ourselves on Superluminal “Just Working”, and this was decidedly not that, so we of course went hunting for what turned out to be one of the toughest bugs we’ve faced in our careers.
The hunt led us deep into the internals of the Linux kernel (again), where we learned more about spinlocks in the kernel than we ever wanted expected to know, and we ended up helping to find & fix a number of issues along the way.
Initial analysis
The problem he was running into was that on his Fedora 42 machine (kernel 6.17.4-200), the system would periodically freeze for short periods while a Superluminal capture was running:
It’s really difficult to remotely debug issues like this, so we first attempted to reproduce the issue in a VM. However, we were unable to after several attempts with various Fedora versions/kernels. We finally tried installing Fedora on a physical machine, and we were able to reproduce it there.
Now that we have a repro, we can start looking into the issue in earnest.
Since the machine is periodically freezing while capturing with Superluminal, we can start by looking at what the capture looks like after opening it. Opening such a capture, we get this:
This is showing the timeline for each thread in the process. A green color means the CPU is actively executing work (i.e. the thread is scheduled in), any other color means the thread is scheduled out and waiting for something. We can immediately spot some suspicious looking areas in the capture (marked in blue) where it appears as if each thread in the process is busy for the ~same amount of time, across all threads, which doesn’t match the workload being profiled.
Each of these areas is 250+ milliseconds where the CPU appears to be fully busy. Zooming in on one of these sections and expanding a thread, we see this:
... continue reading