Phil Eaton’s book club is starting The Art of Multiprocessor Programming, 2nd Edition , which is a very well regarded textbook, and pretty recently updated (2021). I’ve even heard of a couple of authors. I’ve done a lot of concurrent programming, and have always felt like I’ve still got plenty to learn, so I was excited for the topic. So far, what I’ve learned is that I would never recommend this book, despite any merits. Academia certainly struggles to find the right balance between teaching foundational principles and practical information. Being this book is explicitly targeting fourth-year undergraduates and grad students, it should definitely cover the fundamentals, right? So how the heck could it not cover the futex?? Isn’t that just a type of mutex? Surely the futex can’t be THAT important? The name sure sounds like “mutex”, and that is where the name comes from: “fast, user space mutex”. But, it isn’t really, it’s a building block for concurrency primitives that ushered in a modern world of concurrent performance that makes System V (sysv) feel so old and busted, it feels wrong to even call it a dinosaur 🦖, as if it were once mighty? Way back in the last millennium, locking primitives were pretty much all based on the System V IPC (inter-process communication) code, specifically their semaphore code. All common concurrency primitives were over-complicated under the hood, and just didn’t scale well to large numbers of threads. Until Linux added the futex. Going back to the original futuex paper in 2002, it was immediately clear that the futex was a huge improvement in highly concurrent environments. Just in that original paper, their tests with 1000 parallel tasks ran 20-120 times faster than sysv locks..🤯 Needless to say, other common operating systems followed suit, including Windows in 2012 and macOS by 2016. These days, any good locking primitive is going to be based on a futex. You should expect system libraries like pthreads will use the futex extensively. Wait, a futex isn’t a mutex? So what is it, then? When there’s too much contention for a lock, actively trying to acquire it as fast as possible will just eat CPU that other threads could be using for work, to just sit there and wait. It’d be nice to be able to clear up that CPU and put the thread to sleep until the lock is ready. Unless you’re essentially going to build your own thread scheduling implementation in user-space (which is basically what async implementations do at their core), you either need to poll, which is inefficient, or you need operating system support. In sysv IPC, the core tool for getting OS-supported blocking when building higher-level concurrency program was the semaphore, which intertwined locking and waiting. The futex essentially separates the locking from waiting (and waking) tasks. The flexibility you get from separating those two concerns is key to good lock performance. It becomes much easier to avoid unnecessary delays (like sleeps with exponential backoffs) and bottlenecks, particularly system calls themselves, which are quite expensive compared to most of the code involved in locking. For instance, if you’re building a mutex, your unlock operation can skip calling into the kernel for the unlock operation if you can be confident in userland that no threads are waiting. In sysv land, an unlock was always a system call. Essentially, the futex wait() call allows a task to block, queuing that task inside the kernel, on a list specifically associated with a particular memory address, and allows an optional timeout. The wake() operation will dequeue threads from the internal list, running them again. You can choose how many to wake, but generally, the code will wake either 1 or all of them, never anything in between. People often describe the futex as, “Wait on memory address”. That overlooks the notification side, but it’s a much more apt name, and why Windows’ name for this API ( WaitOnAddress ) is superior API naming (to be fair, they did have a decade to think about the name). The memory address you’re waiting on tends to be a 32-bit integer, sometimes a 64-bit integer. The OS doesn’t care much about the semantics of the value inside that integer. However, and very importantly, when calling the wait() operation, the caller must present what it thinks the value of the futex is. If the value has changed, the operation fails. Providing the value protects from waiting for a wake that has already happened. Whenever you need to wait, you’re waiting for some particular state to occur. The OS doesn’t care about the specifics of how your state is encoded in the futex, since it’s responsible for the order of operations on the futex; it won’t enqueue a waiter with an incorrect view of the current state. For example, let’s say you’re a thread who wants to wait for state Y (which might be availability of a lock), and you have checked, and you think the state is X (say… locked). In the time from when you checked to the time of the wait, the state could actually be Y . And, if you’re using a more complicated concurrency primitive like a reader-writer lock, the state could have changed to Z . The OS doesn’t need to care; it just knows when a task needs to take a long, hard look at its decisions. 👀🪞 Futex: the ex-future (i.e., the present) We’ve established it’s important, so let’s supplement the book with some of the content it should have had. Being a low-level primitive, futexes are certainly hard to get right. The OS will make sure all operations on it from its perspective are well ordered, but when you modify state, not only to do you have to have confidence in your algorithm, you have to worry about the compiler or hardware performing relevent operations either out-of-order or in an overlapping way. Still, it’s not too hard to build a basic mutex on top of a futex, and it’s a good exercise to start to show how we can often avoid unnecessary system calls. Let’s start by building ourselves a little wrapper for futex that gives us access to the core functionality, and works across Linux and Mac. Other OSes are cool and all, yet still left as an exercise to the reader. Our futex will be a 32-bit integer (I believe this is the only size on Linux still, so it’s the most portable). But when we use the thing for state management, we will want to play life safe and ensure that any work we do on the contents are guaranteed to be sequentially consistent, where there’s a linear order to the operations, and no thread would have seen anything inconsistent with that order. 🏃🏿‍♂️ While many algorithms can work with much weaker consistency guarantees, we will generally avoid memory ordering optimization because it's so error prone, and often doesn't really make a significant difference for real-world programs. Just say no to premature optimization. To make it easier to get the full sequentially consistent experience, we will explicitly declare our futex to be atomic, even though the APIs we call probably will not explicitly declare them as such in formal parameter declarations. We’ll generally use an explicit atomic call for operations, but declaring variables atomic ensures that, when you don’t explicitly use an atomic operation to access, you get one anyway. #include // for uint32_t #include typedef _Atomic ( uint32_t ) h4x0r_futex_t ; On Linux, the main system call for the futex is a Swiss Army knife with many commands and options, both. Some of the features of the API turned out to be bad ideas, but endure because nobody wants to break code dependent on them. While the futex isn’t too hard in concept, the complexity of the Linux API is pretty staggering. We’ll skip over all the complexity and just write a wrapper for the basic functionality under the assumption that you’re going to use futexes to build stuff within the context of a single process with threads (e.g., we won’t worry about locks shared between processes in this article). Here’s the core functionality we need: The calling thread will wait if the state is correct. The calling thread can specify an optional timeout. We get back a result indicating success ( 0 ) or error (depends on why we didn’t wait, but both the race condition and timeout are valid reasons, as are interrupts). 🧵😴 Being explicit, if the return value is 0 , that means your thread slept, and welcome back from dreamland! In Linux, glibc doesn’t wrap the futex system call, so we’ll have to use their generic system call wrapper, syscall : #include // For struct timespec. #if defined(__linux__) #include #include #include #include static inline int h4x0r_futex_wait_timespec ( h4x0r_futex_t * futex, uint32_t expected, struct timespec * timeout_ptr) { int err = syscall (SYS_futex, futex, FUTEX_WAIT_PRIVATE, expected, timeout_ptr, NULL, 0 ); if (err == - 1 ) { return errno; } return 0 ; } #endif The syscall() wrapper doesn’t return the full error, just -1 , which is why we have to check errno . There, we return the error code (if any) from our wrapper, even though the system call returns -1 on an error, and then passes the specific error via errno . Waking a thread up involves calling the same system call, but we just pass the corresponding wake operation: #include #if defined(__linux__) static inline int h4x0r_futex_wake ( h4x0r_futex_t * futex, bool all) { uint32_t n = all ? INT_MAX : 1 ; return syscall (SYS_futex, futex, FUTEX_WAKE_PRIVATE, n, NULL, NULL, 0 ); } #endif When waking, we do not pass an expected value for the futex (it’s irrelevant), nor do we pass a timeout (this operation can’t block), and the error gets returned directly. Instead, we do need to pass the number of threads to wake, but in our API, we’ve simplified that down to a flag to toggle between one waiter and all waiters. My MacOS wrapper apparently is a bit out of date; I use the somewhat undocumented __ulock interface (its documentation is just some comments in the Darwin source code), but TIL that they added a new, simpler interface, just last year ( os_sync_wait_on_address ). Still, for now, we’ll go old school for better compatibility and build the same two operations: #if defined(__APPLE__) #include // These were never exposed in any headers. extern int __ulock_wait2 ( uint32_t , void * , uint64_t , uint64_t , uint64_t ); extern int __ulock_wake ( uint32_t , void * , uint64_t ); #define H4X0R_NSEC_PER_SEC 1000000000 #define H4X0R_LOCK_COMPARE_AND_WAIT 1 #define H4X0R_LOCK_WAKE_ALL 0x00000100 #define H4X0R_LOCK_WAKE_THREAD 0x00000200 #define H4X0R_WAKE_ALL (H4X0R_LOCK_COMPARE_AND_WAIT | H4X0R_LOCK_WAKE_ALL) #define H4X0R_WAKE_THREAD (H4X0R_LOCK_COMPARE_AND_WAIT | H4X0R_LOCK_WAKE_THREAD) static inline int h4x0r_futex_wait_timespec ( h4x0r_futex_t * futex, uint32_t expected, struct timespec * timeout) { uint64_t timeout_ns = 0 ; if (timeout) { timeout_ns = timeout -> tv_nsec + timeout -> tv_sec * H4X0R_NSEC_PER_SEC; } return __ulock_wait2 (H4X0R_LOCK_COMPARE_AND_WAIT, futex, ( uint64_t )expected, timeout_ns, 0 ); } static inline int h4x0r_futex_wake ( h4x0r_futex_t * futex, bool all) { return __ulock_wake (all ? H4X0R_WAKE_ALL : H4X0R_WAKE_THREAD, futex, 0ULL ); } #endif The missing mutex primitive The book doesn’t really cover the mutex well, focusing more on the term “mutual exclusion”, which gets its own chapter early days, and is thrown around liberally in a chapter about spin locks. Sure, locks don’t need to involve waiting, and the core of a mutex doesn’t require waiting. So a spin lock is definitely a form of mutex. The book does show spin locks, and they’re not hard to build: #include #include typedef _Atomic uint32_t h4x0r_spin_lock_t ; static inline void h4x0r_spin_lock_init ( h4x0r_spin_lock_t * lock) { atomic_store (lock, 0 ); } static inline void h4x0r_spin_lock_acquire ( h4x0r_spin_lock_t * lock) { while ( atomic_fetch_or (lock, 1 )) /* do nothing */ ; } static inline void h4x0r_spin_lock_release ( h4x0r_spin_lock_t * lock) { atomic_store (lock, 0 ); } It might not be intuitive why the above lock works. The atomic_fetch_or() operation performs a bit-wise OR, but returns the value from before the operation began. As implied by the “function” name, the entire operation happens atomically, and in reality, won’t be a proper function; it’ll most likely inline to a tiny bit of assembly. Let’s say 100 threads are contending for the lock, and let’s just imagine no thread gives up the lock, either. We use the version of this API that guarantees sequential consistency. As a result, at runtime, all threads will essentially perform the OR, but only one will see 0 as a return value. That’s the thread that gets the lock. Many people are surprised that any lock can be built with only single bit, but there you go. Still, our simple spin lock has some potential problems: It doesn’t deal with heavy contention (i.e., it doesn’t provide a way to offload CPU via waiting). If a thread mistakenly calls our unlock operation on a lock it doesn’t own, it’ll unlock the lock 😱 If a thread recursively tries to acquire this lock, it’ll end up blocked, waiting for a lock it already holds– deadlock! Some people wouldn’t consider the last one a real problem, as we’ll discuss later. But the other two are definitely worth addressing. Anyway, let’s show how we can deal with all of these issues. When and how to block Most libraries will have separate APIs for spin locks and mutexes. Yet, good mutex implementations do tend to start with a spin lock, and if they try some number of times and fail, then there’s too much contention, so they wait. If the lock is uncontended, the spin lock will be successful on the first try, and will be pretty fast – no system call. With light contention and a short critical section, we may still get the lock without having to make the system call to wait. One thing I wanted to see from the book was guidance on how long mutexes should spin for. Surely, there’s not a one-size-fits-all answer, especially as hardware platforms evolve. Clearly, the overhead of a system call, the number of tasks, and the duration of the critical section could all come into play, but does anyone have any metrics here? Personally, I just tend to use a hard-coded 16 iterations, but with no strong reason. I suppose I probably saw someone else use it once, but also with no explanation. When it comes to dealing with contention, the book covers exponential back-off – when you don’t get the lock due to contention, you exponentially increase the time you wait, in an attempt to try to spread out contention (usually, exponential backup stops at some maximum duration, which usually comes after 5-8 failed attempts). But what do they tell you to do?? Call sleep() for the backoff period. What if contention clears right as you put yourself to sleep? You wait anyway, and while perhaps you had the opportunity to take the lock uncontested had you kept spinning, and you might get really unlucky, and the contention could come back right as you’re waking up. By the way, blocking is so overlooked in this book; they don’t even mention the word polling (I bought the e-book and searched extensively). This issue with using sleep() to block certainly isn’t mentioned, even when it’s casually tossed into their code. You probably have an inkling already– the futex is how modern locks avoid these problems. If a futex is available, exponential backoff (or any polling-based approach) makes little sense. With polling, wakes are essentially arbitrary guesses. With the futex, wakes are explicitly tied to the mutex becoming available. When there’s significant contention with a polling-based approach, if we don’t use exponential back-off, we could end up with many threads continuing to wake up all at the same time, just to contend with each other again. With a futex, we can just wake up a single thread (which might have to contend with new threads coming in). Let’s build it, even though we still won’t be dealing with recursion or accidental unlocks. Given the risk of accidental unlock, we’ll call this our “unsafe” mutex. #define H4X0R_SPIN_COUNT 16 typedef h4x0r_futex_t h4x0r_mutex_unsafe_t ; static inline void h4x0r_mutex_unsafe_init ( h4x0r_mutex_unsafe_t * lock) { atomic_store (lock, 0 ); } static inline void h4x0r_mutex_unsafe_acquire ( h4x0r_mutex_unsafe_t * lock) { for ( uint32_t i = 0 ; i < H4X0R_SPIN_COUNT; i ++ ) { if ( ! atomic_fetch_or (lock, 1 )) { return ; } } while (true) { h4x0r_futex_wait_timespec (( h4x0r_futex_t * )lock, 1 , NULL); if ( ! atomic_fetch_or (lock, 1 )) { return ; } } } static inline void h4x0r_mutex_unsafe_release ( h4x0r_mutex_unsafe_t * lock) { atomic_store (lock, 0 ); h4x0r_futex_wake (lock, false); } Here, we spin for a bit, and if things aren’t contested, then we wait on the futex. Whoever has the lock will notify up to one thread on wake. For our simple use case, it doesn’t really matter why we fail if we don’t wait on the futex, so we just loop. ‼️ You are generally not guaranteed that threads will be awoken in any specific order. Still, generally, that's probably not worth worrying about. Minimizing system calls The futex can help us avoid a bunch of unnecessary wake-ups, and thus a lot of system calls. However, in many cases, mutex access will be totally uncontested, and we’ll be making a system call to wake up… nobody. There are ways to address this pretty simply, allowing us to avoid most spurious wakeups. We’re only using one bit of our futex, and it doesn’t actually matter which bit. So we’re going to move the bit up to the top, and use the rest as a wait-counter. We’ll have threads add to it before they first try to go to sleep, and subtract when they wake up. When the thread holding the lock unlocks the mutex, we’ll use atomic_fetch_and() to remove the mutex’s lock flag, but leave all other bits intact, and we’ll then look to see if the wait-count is zero. If it is, we’ll skip the wake-up. This can still lead to some extra system calls: The mutex could get released after the thread added its value to wait count, but before it actually puts itself to sleep. Other threads may end up adding to the count and going to sleep once we’ve released, and after another thread grabbed the lock. The first case seems scary; it won’t risk a deadlock, because if the unlock happens before the wait, the waiting thread will have the wrong expected value, and the futex call will fail. The second case leads to a spurious wake-up. The woken thread checks the futex again, sees it’s locked, and goes back to sleep, but that’s two extra system calls. Still, this is going to avoid plenty of unnecessary wakes and system calls in practice. Using this idea, let’s build our second mutex. This version is also unsafe, because we’re not yet dealing with ownership, so we’ll modify calls with unsafe2 . But we’ll reuse the h4x0r_mutex_unsafe_t type, and calls that do not change from the previous implementation. // We'll re-use these constants in our last mutex. #define H4X0R_MUTEX_LOCK_ON (1 << 31) #define H4X0R_MUTEX_LOCK_OFF ~(H4X0R_MUTEX_LOCK_ON) // This function will get reused too. static inline bool h4x0r_mutex_value_is_unlocked ( uint32_t value) { return ! (value & H4X0R_MUTEX_LOCK_ON); } static inline bool h4x0r_mutex_unsafe2_try_lock ( h4x0r_mutex_unsafe_t * lock) { uint32_t value = atomic_fetch_or (lock, H4X0R_MUTEX_LOCK_ON); return h4x0r_mutex_value_is_unlocked (value); } static inline uint32_t h4x0r_mutex_unsafe2_add_waiter ( h4x0r_mutex_unsafe_t * lock) { return 1 + atomic_fetch_add (lock, 1 ); } static inline void h4x0r_mutex_unsafe2_acquire ( h4x0r_mutex_unsafe_t * lock) { uint32_t expected; for ( uint32_t i = 0 ; i < H4X0R_SPIN_COUNT; i ++ ) { if ( h4x0r_mutex_unsafe2_try_lock (lock)) { return ; } } expected = h4x0r_mutex_unsafe2_add_waiter (lock); while (true) { if ( h4x0r_mutex_value_is_unlocked (expected) && h4x0r_mutex_unsafe2_try_lock (lock)) { atomic_fetch_add (lock, - 1 ); return ; } h4x0r_futex_wait_timespec (( h4x0r_futex_t * )lock, expected, NULL); expected = atomic_load (lock); } } static inline void h4x0r_mutex_unsafe2_release ( h4x0r_mutex_unsafe_t * lock) { uint32_t waiters = atomic_fetch_and (lock, H4X0R_MUTEX_LOCK_OFF); if (waiters != H4X0R_MUTEX_LOCK_ON) { h4x0r_futex_wake (lock, false); } } We broke out the lock test and the attempt to lock into separate inline functions for clarity. Also, we did the same for the function that registers ourselves as a waiter, as it’s a little more complicated than just atomically bumping the wait count. Much like atomic_fetch_or() , the function atomic_fetch_add() will return the value there BEFORE our modification. But, we’re going to have to tell the futex routine what value we think is there when we go to wait, so we also need to add to the wait count on the value that gets returned. Notice that, when we do add to the wait count, it doesn’t matter how many times we wake from the futex; we only add ourselves one time, and remove ourselves only once we acquire the lock. We definitely don’t want ourselves to be counted when we test to see if we should make the system call to wake a waiter. Here too, atomic_fetch_and() will return a value before our operation is applied. So in this mutex, if there are no waiters at the point of our atomic operation, the value we get back will actually be H4X0R_MUTEX1_LOCK_ON , even though we will have just set the mutex’s value to 0 . Here, we don’t redo the operation; we just make the proper comparison. Asserting Ownership Dealing with ownership isn’t too big a deal; we just need to keep track of who owns the lock, if anyone, and check it when it’s safe to do so. We’ll use pthread_t for identity, which we can directly store and compare, even though it has a slight issue– the underlying implementation of the data structure is implementation-dependent. This means we can’t 100% reliably keep track of the thread with just the pthread_t value, because we don’t know a portable value that says “not a thread.” The easiest thing for us to do is just take up some more space, and keep a flag to keep track of whether it’s owned or not, for when we are checking ownership. Once we check ownership, our mutex is safe, so this will be our default mutex type (we’ll modify this code to make a recursive mutex later). Here, then, is our new mutex definition and initializer: typedef struct { h4x0r_futex_t futex; _Atomic ( pthread_t ) owner; _Atomic ( bool ) owned; } h4x0r_mutex_t ; static inline void h4x0r_mutex_init ( h4x0r_mutex_t * mutex) { * mutex = ( h4x0r_mutex_t ) { .futex = 0 , .owned = false, }; } We’ve declared all of the fields atomic. Yes, we’re only going to change them when we own the mutex, but we can’t always count on sequential consistency on the fields if they aren’t atomic, even if they’re in the same struct as threads that are. There are platforms (like ARM) where you’re particularly at risk of issues if you’re not really careful. When in doubt, go for full sequential consistency. The atomics do force sequential consistency when we update those fields. Without it, we might end up with, for instance, thread A yielding the futex, but their ownership flag still being set when thread B quickly grabs the lock. In the rest of this implementation, we’re going to take the same basic approach we did with the previous mutex, with the significant changes (besides moving from a uint32_t only to a data structure) being: When we unlock a mutex, we’ll double-check that we own it first. If we do NOT, that’s a fatal error, and we’ll print a message and abort. We’ll check for ownership before we acquire a lock. With this check, if there’s an owner, and we’re not it, that’s also a fatal error. We’ll also add a timeout field to our lock, which we’ll pass down to the futex if we block (the time we spend spinning will be irrelevant). static inline bool h4x0r_mutex_try_lock ( h4x0r_mutex_t * lock) { uint32_t value = atomic_fetch_or ( & lock -> futex, H4X0R_MUTEX_LOCK_ON); pthread_t self; if ( ! h4x0r_mutex_value_is_unlocked (value)) { return false; } // If what we read when we wrote says "unlocked", then we // successfully acquired the lock. self = pthread_self (); if ( atomic_load ( & lock -> owned)) { if ( pthread_equal (self, atomic_load ( & lock -> owner))) { fprintf (stderr, "Mutex was used recursively. " ); } else { fprintf (stderr, "Acquired a lock owner didn't properly yield. " ); } abort (); } // We have the lock, so we make it known. atomic_store ( & lock -> owned, true); atomic_store ( & lock -> owner, self); return true; } static inline uint32_t h4x0r_mutex_add_waiter ( h4x0r_mutex_t * lock) { return 1 + atomic_fetch_add ( & lock -> futex, 1 ); } static inline bool h4x0r_mutex_acquire ( h4x0r_mutex_t * lock, struct timespec * timeout) { uint32_t expected; for ( uint32_t i = 0 ; i < H4X0R_SPIN_COUNT; i ++ ) { if ( h4x0r_mutex_try_lock (lock)) { return true; } } expected = h4x0r_mutex_add_waiter (lock); while (true) { if ( h4x0r_mutex_value_is_unlocked (expected) && h4x0r_mutex_try_lock (lock)) { atomic_fetch_add ( & lock -> futex, - 1 ); return true; } int err = h4x0r_futex_wait_timespec (( h4x0r_futex_t * ) & lock -> futex, expected, timeout); if (err == ETIMEDOUT) { return false; } expected = atomic_load ( & lock -> futex); } } static inline void h4x0r_mutex_release ( h4x0r_mutex_t * lock) { if ( ! atomic_load ( & lock -> owned) || ! pthread_equal ( pthread_self (), atomic_load ( & lock -> owner))) { fprintf (stderr, "Thread unlocked a mutex it doesn't own. " ); abort (); } atomic_store ( & lock -> owned, false); uint32_t waiters = atomic_fetch_and ( & lock -> futex, H4X0R_MUTEX_LOCK_OFF); if (waiters != H4X0R_MUTEX_LOCK_ON) { h4x0r_futex_wake ( & lock -> futex, false); } } We implemented our locking capability on top of h4x0r_mutex_try_lock() , which makes a single attempt to claim a lock. If it succeeds, it then performs its validity checks, and if THOSE succeed, it sets the ownership info. When we’re releasing the lock, we test the boolean to see if there’s an owner. If that’s set, we then ensure that we’re the right owner. The try_lock is a common API feature for mutexes, as an alternative to a timeout. Does it feel like we’re repeating ourselves? There’s a very good question the book didn’t seem to opine on: whether you should ever be using recursive mutexes at all. For mutexes, I tend to lean towards no, as it encourages sloppy programming. But, I will admit to being on the fence, and you’re mature enough to make up your own minds. So let’s look at what we’d have to do. To make our previous lock recursive, we’ll keep a field that keeps track of the levels of nesting. To avoid deadlocking with ourselves, our lock functions will need to check to ensure they don’t own the lock before their first lock attempt. For that reason, the core try_lock operation is moved to a call marked internal , specifically h4x0r_mutex_recursive_internal_try_lock() . The call h4x0r_mutex_recursive_try_lock() only needs to perform the ownership check, then can make the internal call. Here’s what initialization looks like now: typedef struct { h4x0r_futex_t futex; _Atomic ( uint32_t ) depth; _Atomic ( pthread_t ) owner; _Atomic ( bool ) owned; } h4x0r_mutex_recursive_t ; static inline void h4x0r_mutex_recursive_init ( h4x0r_mutex_recursive_t * mutex) { * mutex = ( h4x0r_mutex_recursive_t ) { .futex = 0 , .depth = 0 , .owned = false, }; } That’s no big deal. Since two different lock calls now need to check ownership, we’ll break that out into its own helper. We’re going to need to check ownership before we try to lock the first time. We’ll have our function return true if the calling thread already owns the lock, and false if not. So it will return false: If the mutex is not owned. If it’s locked, but the current thread doesn’t own it. If it’s not one of these two cases, then it’s a recursive call by the owner. This call will bump up the depth counter before returning true ; the caller can immediately bail. static inline bool h4x0r_mutex_recursive_check_ownership ( h4x0r_mutex_recursive_t * lock, pthread_t self) { if ( ! atomic_load ( & lock -> owned)) { return false; } if ( pthread_equal (self, atomic_load ( & lock -> owner))) { atomic_fetch_add ( & lock -> depth, 1 ); return true; } return false; } If we didn’t own the lock, once we acquire the lock, we need to assert our ownership. At this point, we can double-check that the lock is not marked as owned. If it were, that’d indicate a bug in our implementation, so it could make some sense to skip that check. // Called from our internal try-lock call, once it knows we definitely // just acquired ownership. static inline void h4x0r_mutex_recursive_new_ownership ( h4x0r_mutex_recursive_t * lock, pthread_t self) { if ( atomic_load ( & lock -> owned)) { fprintf (stderr, "Acquired a lock owner didn't properly yield. " ); abort (); } atomic_store ( & lock -> owned, true); atomic_store ( & lock -> owner, self); atomic_store ( & lock -> depth, 1 ); } When the user unlocks a mutex, we will also need to check ownership. We’ll also need to distinguish between the case where we’re done nesting, and should actually unlock, and when we shouldn’t: // Called when definitely holding the lock. // Returns true if we are nested, and false if we aren't (proper unlock). static inline bool h4x0r_mutex_recursive_nesting_check ( h4x0r_mutex_recursive_t * lock) { if ( ! pthread_equal ( pthread_self (), atomic_load ( & lock -> owner))) { fprintf (stderr, "Thread is trying to unlock a lock it doesn't own. " ); abort (); } if ( atomic_fetch_add ( & lock -> depth, - 1 ) > 1 ) { return true; } atomic_store ( & lock -> owned, false); return false; } The rest of the recursive lock implementation is then straightforward, given our previous locks: static inline bool h4x0r_mutex_recursive_internal_try_lock ( h4x0r_mutex_recursive_t * lock, pthread_t self) { uint32_t value = atomic_fetch_or ( & lock -> futex, H4X0R_MUTEX_LOCK_ON); bool result = h4x0r_mutex_value_is_unlocked (value); if (result) { h4x0r_mutex_recursive_new_ownership (lock, self); } return result; } static inline uint32_t h4x0r_mutex_recursive_add_waiter ( h4x0r_mutex_recursive_t * lock) { return 1 + atomic_fetch_add ( & lock -> futex, 1 ); } static inline bool h4x0r_mutex_recursive_acquire ( h4x0r_mutex_recursive_t * lock, struct timespec * timeout) { uint32_t expected; pthread_t self = pthread_self (); if ( h4x0r_mutex_recursive_check_ownership (lock, self)) { // We already owned the lock, and incremented the nesting count. return true; } for ( uint32_t i = 0 ; i < H4X0R_SPIN_COUNT; i ++ ) { if ( h4x0r_mutex_recursive_internal_try_lock (lock, self)) { // internal_try_lock will set up ownership. return true; } } expected = h4x0r_mutex_recursive_add_waiter (lock); while (true) { if ( h4x0r_mutex_value_is_unlocked (expected) && h4x0r_mutex_recursive_internal_try_lock (lock, self)) { atomic_fetch_add ( & lock -> futex, - 1 ); return true; } int err = h4x0r_futex_wait_timespec (( h4x0r_futex_t * ) & lock -> futex, expected, timeout); if (err == ETIMEDOUT) { return false; } expected = atomic_load ( & lock -> futex); } } static inline bool h4x0r_mutex_recursive_try_lock ( h4x0r_mutex_recursive_t * lock) { pthread_t self = pthread_self (); if ( h4x0r_mutex_recursive_check_ownership (lock, self)) { // We already owned the lock, and incremented the nesting count. return true; } return h4x0r_mutex_recursive_internal_try_lock (lock, self); } static inline void h4x0r_mutex_recursive_release ( h4x0r_mutex_recursive_t * lock) { if ( h4x0r_mutex_recursive_nesting_check (lock)) { // We were nested, and the decrement happened, so we're done. return ; } uint32_t waiters = atomic_fetch_and ( & lock -> futex, H4X0R_MUTEX_LOCK_OFF); if (waiters != H4X0R_MUTEX_LOCK_ON) { h4x0r_futex_wake ( & lock -> futex, false); } } Remaining issues The core issue we have left is, what happens when a thread exits or dies when a mutex is locked?? It’s a bit much for us to cover today, but here’s a sketch of what we’d need to do: Each thread will need a private list of locks it is actively holding, which we would modify in the calls above. In most thread APIs, we can register a callback when a thread exits (or is canceled). We’d need to register a handler for this, probably once per thread, when it’s launched. That callback will need to go through the list and break any locks still held. Many people won’t worry about crashed threads, as they often will crash the whole program. However, you can catch the signal a crash generates and keep the overall process from terminating. On Linux, you can compare your view of what threads are live with the OS’s view in proc . On other platforms, you might not be able to get the exact thread that crashed. However, if you’re managing threads that launch, you will probably have a way to give yourself visibility into what threads do not check in over a very short time. Another related issue comes up once you allow mutexes to span processes via memory shared across processes. The same basic code all works fine, modulo some changes to how to use a futex. Our problem is, what happens when a process holding a lock forks?? That’s also an issue you can handle if you’re managing all the locks. But if it’s a problem mentioned in the book– I can’t find it, after skimming over 31 uses of the word fork (outside the context of cutlery🍴, which gets one use). Those uses are spread across a mere 9 pages. Almost all of those uses are talking specifically about algorithms leveraging Java’s ForkJoin class, which is a thread pooling scheme within a process, not a proper posix fork() . Reading / Writing The Art / Engineering of Multiprocessor Programming The book being named with the word “Art” is apt, because it’s not really engineering best practices. To be fair, the book does mention that some locking primitives are lock-aware and some aren’t. It also shows how to build a recursive lock using a non-recursive lock (in Java, of course). But it does so using a condition variable, which is horrible. If you just care about engineering best practices, you should have easy access to a reentrant mutex already (pthreads provides one). I’d expect that to be more performant than a condition variable, which already needs a mutex– it over-complicates the implementation. If you’re trying to learn how things work under the hood, the condition variable isn’t giving you what you need to know. What you need to know is the futex , full stop. Also, the book doesn’t seem to opine on recursive locks. It just shows them. I think the conventional wisdom on using them in mutexes is important to understand. And, I also think it becomes a different discussion if you’re talking about other concurrency primitives. For instance, when using reader-writer locks (RW locks), I think recursive locks are really appropriate, just incredibly hard to get right. If you’ve never used an RW lock, they allow any number of readers to lock for reading, all at the same time. If a writer is waiting to lock, it waits for all readers to clear (and in sane implementations, new readers cannot be added with a waiting writer). But writers, once they do get the lock, get exclusive access, so they can edit without fear of readers seeing an inconsistent state. In concept, that’s a great primitive to use for operations like dynamic lists. If there’s no mutation happening, but there could be lots of parallel reading, reads can end up pretty cheap, because they only have to stop for writes. Consider an API call like: void h4x0r_list_append_all_contents ( h4x0r_list_t * dst, h4x0r_list_t * items); The idea being, we’re going to combine the two lists by copying the contents of the items list onto the end of the dst list. Now, we clearly want to grab a write lock on dst . And for items , we definitely want a read lock on that one. But, what if we were given that API, and wanted to double a list, by doing: h4x0r_list_add (l1, l1); I’ve seen APIs like this, where the same item is routinely passed through a nested set of construction operations that can mix reading and mutation. So you might need reads to nest, you might need writes to nest, all for the same item, in the same chain of calls. We’d have no problems building an API that could handle such things in a program without concurrency, as long as we anticipate the need and fix the number of items to copy in before we start. But in a multi-threaded app, if we don’t allow nested locks, we now also have to anticipate every situation where it might make sense for a list to be passed multiple times, and then add code to explicitly test for it. So it’d be great to have solid semantics for RW lock nesting. But unfortunately, the POSIX standard leaves recursive locking as undefined behavior, which means that even if a conformant library provides such recursion, you definitely should not use it. And in practice, behavior across common implementations is not remotely consistent. There’s a good reason why this was left undefined – it’s kind of hard. Since multiple threads can hold a lock, each thread must have a separate read nesting level. And then how do you deal with writes that might be intermingled in there? How do you keep the accounting correct, so that you don’t drop write access too early, for instance? These problems are solvable, though (and at some point soon, I’ll share my RW lock). But in The Art Of Multiprocessor Programming, I can’t find any mention of such issues, neither warning you about the semantic challenges that can surprise you, nor showing how such things might be dealt with. It’s not a surprise, since they don’t even mention the futex. Nor do they seem to cover async runtimes, despite their popularity (perhaps a bit more excusable since they’re generally multiplexing a single thread, but it still feels too important not to cover well). RTFC The full source code is available on codeberg. 🏁 In conclusion 🏁 I have plenty of other problems with this book, but they’d each probably need their own rant, and I’m not going to spend the time. But in short, this isn’t a Computer Science textbook, so much as it is a history book. To other CS academics writing textbooks today, whether you’re focused on theory or practice, please make sure you at least acknowledge the important concepts that were around when you were writing the book! And ideally, make sure you cover some content from the current millennium. If the only significant thing indicating the book was written (or edited) recently is the copyright date, consider that you might be doing a disservice with your book. kthxbai, Lee T, expert waiter 💁🏻‍♂️ (with the finest threads 🧵)