%CPU utilization is a lie

I deal with a lot of servers at work, and one thing everyone wants to know about their servers is how close they are to being at max utilization. It should be easy, right? Just pull up top or another system monitor tool, look at network, memory and CPU utilization, and whichever one is the highest tells you how close you are to the limits.

And yet, whenever people actually try to project these numbers, they find that CPU utilization doesn't quite increase linearly. But how bad could it possibly be?

To answer this question, I ran a bunch of stress tests and monitored both how much work they did and what the system-reported CPU utilization was, then graphed the results.

For my test machine, I used a desktop computer running Ubuntu with a Ryzen 9 5900X (12 core / 24 thread) processor. I also enabled Precision Boost Overdrive (i.e. Turbo).

I vibe-coded a script that runs stress-ng in a loop, first using 24 workers and attempting to run them each at different utilizations from 1% to 100%, then using 1 to 24 workers all at 100% utilization. It used different stress testing method and measured the number of operations that could be completed ("Bogo ops").

The reason I did two different methods was that operating systems are smart about how they schedule work, and scheduling a small number of workers at 100% utilization can be done optimally (spoilers) but with 24 workers all at 50% utilization it's hard for the OS to do anything other than spreading the work evenly.

You can see the raw CSV results here.

The most basic test just runs all of stress-ng's CPU stress tests in a loop.

You can see that when the system is reporting 50% CPU utilization, it's actually doing 60-65% of the actual maximum work it can do.

But maybe that one was just a fluke. What if we just run some random math on 64-bit integers?

... continue reading