Skip to content
Tech News
← Back to articles

Restartable Sequences

read original more articles
Why This Matters

Restartable sequences (rseq) are an emerging system programming technique that enables lock-free, thread-safe data structures, significantly improving performance on multi-core processors. While currently limited to Linux with handwritten assembly, future support across operating systems and languages promises widespread adoption, leading to substantial speedups in critical applications like memory allocation and matrix computations.

Key Takeaways

May 31st, 2026 @ justine's web page

Restartable Sequences

The best kept secret at the frontier of system programming right now is the Linux 4.18+ (c. 2018) concept of restartable sequences or rseq for short. They allow you to create thread-safe data structures without locks or atomics which scale to microprocessors with many cores.

It's currently only possible to use rseq on Linux using handwritten assembly code. However I believe in the future, all operating systems will be updated to support rseq() , all system programming languages will be redesigned to be able to express restartable sequences, and all data structure libraries will be rewritten to use them.

So far the only software I've seen using rseq is tcmalloc, jemalloc, glibc, and cosmopolitan. That's destined to change now that microprocessors with 128 or even 192 cores are becoming inexpensive. For example,

On my $160 Raspberry Pi 5 (which has 4 cores), rseq makes my malloc() implementation 3x faster versus having a dlmalloc mspace assigned to each thread. For most developers, that's a take it or leave it kind of improvement. However,

implementation versus having a dlmalloc mspace assigned to each thread. For most developers, that's a take it or leave it kind of improvement. However, On my $4,834 System76 Thelio Astra with Ampere's 128 core 3GHz Altra CPU, rseq makes cosmopolitan malloc() go 34x faster (compared to sharding ops over an array of mspaces using sched_getcpu()%32 )

go (compared to sharding ops over an array of mspaces using ) On my $17,628.55 AMD Threadripper Pro 7995WX with 96 cores, rseq makes my malloc() 43x faster (versus using that same sched_getcpu() mutex sharding technique)

System programmers who don't have a workstation like the ones above are going to be left behind like a dinosaur, with no opportunity to pluck the low hanging fruit of 10x performance optimizations. For example, I wouldn't have been able to pull off the speedups I made to matrix multiplication last year if I hadn't splurged on a 96 core CPU. It put me in the poor house for a few months (since the cheaper Ampere workstations weren't available it the time) but was so worth it, since my work received press coverage, it made me famous in the AI community, it helped my project get adopted by 32% of organizations, and even earned me a job offer from Google to work in their Gradient Canopy improving TPU performance for Gemini.

If you do have one of these microprocessors, then restartable sequences are going to be one of the most important tricks you'll use to exploit its capabilities. This tutorial will show you how they work, and provide you with a concrete example for pushing and popping which can be immediately useful.

... continue reading