Bringing restartable sequences out of the niche Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net. The restartable sequences feature, which was added to the 4.18 kernel in 2018, exists to enable better performance in certain types of threaded applications. While there are users for restartable sequences, they tend to be relatively specialized code; this is not a tool that most application developers reach for. Over time, though, the use of restartable sequences has grown, and it looks to grow further as the feature is tied to new capabilities provided by the kernel. As restartable sequences become less of a niche feature, though, some problems have turned up; fixing one of them may involve an ABI change visible in user space. A restartable sequences overview As the number of CPUs in a system grows, so does the desire to write and run highly parallel programs. In user space, as well as in the kernel, concurrent code using locks eventually runs into scalability problems, leading to an interest in the use of lockless algorithms instead. The kernel has a distinct advantage when it comes to lockless concurrent access, though, in that code running in kernel space can prevent interruptions from interrupts, preemption, or migration to another CPU during critical sections; user space has no such guarantees. So any user-space lockless algorithm must work correctly in an environment where code execution can be interrupted at any time. Restartable sequences are one solution to this problem. To use this feature, an application must designate a critical section that does some work, culminating in a single atomic instruction that commits whatever change is being made. One of the earliest uses of restartable sequences was a memory allocator that would, within its critical section, identify an object to allocate from a per-CPU list, then remove it from that list with an atomic compare-and-exchange operation. It would obviously be problematic if this critical section were to be interrupted between the identification of the object to allocate and the final step committing the change. To work around this problem, the application informs the kernel of the critical section by filling a structure ( struct rseq_cs ) describing the address range of the instructions within that section, along with a special abort address. If the kernel interrupts the execution of the relevant thread, to preempt it, move it to another CPU, or to deliver a signal, it will, on return to user space check for the existence of this structure. If the current instruction pointer lies within the critical section, the kernel will cause execution to jump to the abort address. The thread can then clean up after its aborted attempt and restart the operation. The hope, of course, is that most runs through the critical section will complete without interruption, accomplishing their job in about the fastest way possible. The extra overhead of dealing with aborts only enters the picture in cases where the critical section is interrupted for some reason. Using restartable sequences requires setting up a shared memory area using the rseq() system call. The actual critical section almost certainly must be written in assembly. Until relatively recently, there was no support for restartable sequences in the GNU C Library (glibc), though that has been changing. Given the effort that is required to use this feature properly, and given that it is a Linux-only feature, it is not surprising that relatively few application developers have made use of it. Not so niche anymore The shared-memory area set up by rseq() allows a thread to tell the kernel when it is executing in a critical section, but the communication through that area goes both ways. Whenever the kernel runs a thread, it will update this area to reflect which CPU and NUMA node the thread ended up on. That information is used by glibc to accelerate functions like sched_getcpu() , which can obtain the needed information directly without calling into the kernel. Other uses of restartable sequences within glibc are under consideration; Mathieu Desnoyers (the creator of this feature) has been advocating for increased use of it within glibc to improve scalability for some time. Recently, there has been interest in another feature intended to help user space write efficient critical sections: time-slice extension. The idea here is that a thread running in a critical section (perhaps holding a lock) could indicate that fact to the kernel as a way of politely asking to not be preempted, even if its time slice runs out. The kernel could, if it is in a good mood, respond by giving the thread a little bit more time to finish its work and release the lock, so that other threads can proceed. It makes obvious sense to tie a feature like this to restartable sequences, so the time-slice-extension patch set added the "please can I run a little longer" bit to the shared area. Time-slice extension is a feature that affects the core CPU scheduler, and it is likely to see wider use, so it is not surprising that it has brought more attention to restartable sequences. Thomas Gleixner dug into the patches and, not entirely liking what he saw, put together a proposal for an improved interface for time-slice extension. His changes are likely to be the way this work proceeds in the future, and may merit an article on its own, but Gleixner also found a few things to improve in the restartable-sequences code as well. The trouble with restartable sequences His observations in this area led to the posting of a separate patch set aimed at addressing the problems he found with restartable sequences. The end result is significantly improved performance, which is a good thing given that the performance cost of restartable sequences has been starting to attract attention. Jens Axboe, for example, responded to the series saying: " I'd see rseq at 2-3% of overall CPU time last I tested ". Kernel developers will work long and hard to eliminate a performance hit much smaller than that. A couple of the things that Gleixner fixed are good examples of how the combination of history and oversights can slow things down. There are a number of ways in which a user-space thread might lose access to the CPU. The restartable-sequences code in current kernels maintains a field, rseq_event_mask , in each thread's task_struct structure to track those ways; there are separate bits for preemption, the delivery of a signal, and migration to a different CPU. So, for example, if the kernel ends up delivering a signal to a task, it will set the RSEQ_EVENT_SIGNAL bit in rseq_event_mask . When the time comes to return to user space after whatever has happened, rseq_need_restart() will be called to see if a critical section was interrupted; in that case, rseq_event_mask will be consulted to see whether the critical section should be continued, or whether the thread should instead be diverted to the abort address. Gleixner noticed a couple of interesting problems with this code. One is that the kernel is constantly setting the bits in rseq_event_mask , but those bits are only cleared in rseq_need_restart() , and only when the thread happens to be executing within a critical section at the time. Otherwise those bits just pile up there. That is likely to result in a spurious restart the next time an interrupt occurs while the thread is executing within a critical section, taking away some of the performance advantage that the restartable sequences feature was added to provide. Perhaps more tellingly, though, those bits exist to allow a thread to specify when its critical section should be aborted. Given the nature of any given section, some sorts of events do not risk breaking the algorithm; a thread might be able to handle a signal without risking concurrent access to data needed by its critical sections, for example. But that feature was deprecated for the 6.0 release in 2022; it added complexity without bringing a lot of value. That deprecation was of a relatively severe variety; any process trying to use that feature would be immediately killed with a SIGSEGV signal. So the chances are pretty good that there are no actual users of the feature in the wild now. Gleixner replaced all of that machinery with a single "an event happened" boolean value. This elimination also allowed the removal of some accesses to user-space memory, which speeds things up. User-space access from the kernel is never entirely cheap, and the need for Spectre mitigations has made it even slower. As a separate optimization, the series changes the remaining user-space access to the masked user-space-access primitives (added in 6.12) that carry a lower mitigation cost. There are two ways that a thread running in user space can end up switching to kernel mode: an interrupt happens, or the thread makes a system call. All of the cases that restartable sequences are meant to handle will be the result of an interrupt of some type or another. Threads in critical sections are not supposed to make system calls at all. When running on a kernel built with the DEBUG_RSEQ configuration option, making a system call while in a critical section will result in the immediate termination of the offending process — a gentle hint that the restartable sequences feature is not being used correctly. Without that option (as will be the case in production systems), though, the kernel will handle a system-call return in the same way as an interrupt return: if the thread is running in a critical section and an event has occurred, control will be redirected to the abort handler. Gleixner's work separates the handling of the interrupt and system-call case, since the time-slice-extension feature will only apply for the former. As an optimization, he removed the restartable-sequences handling for system-call returns; the relevant patch notes: This changes the current behaviour, which just blindly fixes up the critical section unconditionally in the syscall case. But that's a user space problem when it invokes a syscall from within a critical section and expects it to work. That code was clearly never tested on a debug kernel and user space can keep the pieces. If there are users that behave in this way, and the code has worked so far, they may be in for an unpleasant surprise once this change goes in. That would likely become the cause of an interesting conversation; "user space was always broken" is not normally considered an acceptable argument for causing a working program to break. There is a good chance that this change will have to be reverted if trouble appears. Given the relatively limited use of restartable sequences so far, though, there is reason to hope that no such user exists. It may be a little while until we find out, anyway. This work will need review, including, presumably, a look at how it interacts with the time-slice extension work. For extra fun, Gleixner has let it be known that he has found more things to fix, so further patches will be forthcoming. But, at their core, the current changes appear to make sense, addressing problems that have managed to sneak into the code over the years. Almost any body of code can benefit from restarted development involving a fresh set of eyes. to post comments