October 20, 2025
The Death of Thread Per Core
Programming language async runtimes are very focused on handling asynchronous, possibly long running tasks, that might yield for a variety of reasons, that themselves might spawn future work.
In an async runtime like async Rust, the model is that a task can yield, which, conceptually, creates a new piece of work that gets shoved onto the work queues (which is "resume that task"). You might not think of it as "this task is suspended and will be resumed later" as much as "this piece of work is done and has spawned a new piece of work." This new piece of work gets pushed onto a local queue for later processing by the same thread. The primary distinction between thread-per-core approaches and work-stealing approaches is that in work-stealing models, if one thread doesn't have enough work to do, it can "steal" that task and move it over to its own queue.
This has several immediate consequences:
It has to be okay to move those pieces of work across thread boundaries. This is the cause of people's frustration with their futures having to be Send , in Rust.
, in Rust. Work can be more evenly balanced. If stealing isn't allowed, then there might be a thread, or handful of threads, with a long work queue, while all the others (and their associated CPU cores) sit idle. Stealing is an elegant solution to that problem.
If any task can be stolen from any other thread, you lose certain locality guarantees: if you know stealing isn't allowed, and two tasks both operate on similar data, you might hope that they can benefit from sharing cache lines.
In the data processing world, for a couple years there it seemed like the needle had firmly swung in the direction of thread-per-core. Yes, of course you should partition your data across threads—cross-core data movement is the only enemy! Of course skewed data is a problem to be solved at a higher level, the data processing layer is optimized to scream through all the data you give it, so needing to be friendly in how you dish that work out is a small price to pay.
If your keys are basically random, this is great: the benefits are real, data tends to stay in cache, you don’t need slow MESI messages creating contention, and implementation is often dramatically simplified by restricting parallelism to very specific points in the code.
... continue reading