From Rust to reality: The hidden journey of fetch_max

QuestDB is the open-source time-series database for demanding workloads—from trading floors to mission control It delivers ultra-low latency, high ingestion throughput, and a multi-tier storage engine. Native support for Parquet and SQL keeps your data portable, AI-ready—no vendor lock-in. I occasionally interview candidates for engineering roles. We need people who understand concurrent programming. One of our favorite questions involves keeping track of a maximum value across multiple producer threads - a classic pattern that appears in many real-world systems. Candidates can use any language they want. In Java (the language I know best), you might write a CAS loop, or if you're feeling functional, use updateAndGet() with a lambda: AtomicLong highScore = new AtomicLong(100); [...] highScore.updateAndGet(current -> Math.max(current, newScore)); But that lambda is doing work - it's still looping under the hood, retrying if another thread interferes. You can see the loop right in AtomicLong's source code. Then one candidate chose Rust. I was following along as he started typing, expecting to see either an explicit CAS loop or some functional wrapper around one. But instead, he just wrote: high_score.fetch_max(new_score, Ordering::Relaxed); "Rust has fetch_max built in," he explained casually, moving on to the next part of the problem. Hold on. This wasn't a wrapper around a loop pattern - this was a first-class atomic operation, sitting right there next to fetch_add and fetch_or . Java doesn't have this. C++ doesn't have this. How could Rust just... have this? After the interview, curiosity got the better of me. Why would Rust provide fetch_max as a built-in intrinsic? Intrinsics usually exist to leverage specific hardware instructions. But x86-64 doesn't have an atomic max instruction. So there had to be a CAS loop somewhere in the pipeline. Unless... maybe some architectures do have this instruction natively? And if so, how does the same Rust code work on both? I had to find out. Was the loop in Rust's standard library? Was it in LLVM? Was it generated during code generation for x86-64? So I started digging. What I found was a fascinating journey through five distinct layers of compiler transformations, each one peeling back another level of abstraction, until I found exactly where that loop materialized. Let me share what I discovered. Let's start with what that candidate wrote - a simple high score tracker that can be safely updated from multiple threads: use std::sync::atomic::{AtomicU64, Ordering}; fn main() { let high_score = AtomicU64::new(100); // [...] // Another thread reports a new score of 200 let _old_score = high_score.fetch_max(200, Ordering::Relaxed); // [...] } // Save this snippet as `main.rs` we are going to use it later. This single line does exactly what it promises: atomically fetches the current value, compares it with the new one, updates it if the new value is greater, and returns the old value. It's safe, concise, and impossible to mess up. No explicit loops, no retry logic visible anywhere. But how does it actually work under the hood? Before our fetch_max call even reaches anywhere close to machine code generation, there's another layer of abstraction at work. The fetch_max method isn't hand-written for each atomic type - it's generated by a Rust macro called atomic_int! . If we peek into Rust's standard library source code, we find that AtomicU64 and all its methods are actually created by this macro: atomic_int! { cfg(target_has_atomic = "64"), // ... various configuration attributes ... atomic_umin, atomic_umax, // The intrinsics to use 8, // Alignment u64 AtomicU64 // The type to generate } Inside this macro, fetch_max is defined as a template that works for any integer type: pub fn fetch_max(&self, val: $int_type, order: Ordering) -> $int_type { // SAFETY: data races are prevented by atomic intrinsics. unsafe { $max_fn(self.v.get(), val, order) } } The $max_fn placeholder gets replaced with atomic_umax for unsigned types and atomic_max for signed types. This single macro definition generates fetch_max methods for AtomicI8 , AtomicU8 , AtomicI16 , AtomicU16 , and so on - all the way up to AtomicU128 . So our simple fetch_max call is actually invoking generated code. But what does the atomic_umax function actually do? To answer that, we need to see what the Rust compiler produces next. Now that we know fetch_max is macro-generated code calling atomic_umax , let's see what happens when the Rust compiler processes it. The compiler doesn't go straight to assembly. First, it translates the code into an intermediate representation. Rust uses the LLVM compiler project, so it generates LLVM Intermediate Representation (IR). If we peek at the LLVM IR for our fetch_max call, we see something like this: ; Before the transformation bb7: %0 = atomicrmw umax ptr %self, i64 %val monotonic, align 8 ... This is LLVM's language for saying: "I need an atomic read-modify-write operation. The modification I want to perform is an unsigned maximum." This is a powerful, high-level instruction within the compiler itself. But it poses a critical question: does the CPU actually have a single instruction called umax ? For most architectures, the answer is no. So how does the compiler bridge this gap? My goal is not to merely describe what is happening, but to give you the tools to see it for yourself. You can trace this transformation step-by-step on your own machine. First, tell the Rust compiler to stop after generating the LLVM IR: rustc --emit=llvm-ir main.rs This creates a main.ll file. This file contains the LLVM IR representation of your Rust code, including our atomicrmw umax instruction. Keep the file around; we'll use it in the next steps. We're missing something important. How does the Rust function atomic_umax actually become the LLVM instruction atomicrmw umax ? This is where compiler intrinsics come into play. If you dig into Rust's source code, you'll find that atomic_umax is defined like this: /// Updates `*dst` to the max value of `val` and the old value (unsigned comparison) #[inline] #[cfg(target_has_atomic)] #[cfg_attr(miri, track_caller)] // even without panics, this helps for Miri backtraces unsafe fn atomic_umax(dst: *mut T, val: T, order: Ordering) -> T { // SAFETY: the caller must uphold the safety contract for `atomic_umax` unsafe { match order { Relaxed => intrinsics::atomic_umax::(dst, val), Acquire => intrinsics::atomic_umax::(dst, val), Release => intrinsics::atomic_umax::(dst, val), AcqRel => intrinsics::atomic_umax::(dst, val), SeqCst => intrinsics::atomic_umax::(dst, val), } } } But what is this intrinsics::atomic_umax function? If you look at its definition, you find something slightly unusual: /// Maximum with the current value using an unsigned comparison. /// `T` must be an unsigned integer type. /// /// The stabilized version of this intrinsic is available on the /// [`atomic`] unsigned integer types via the `fetch_max` method. For example, [`AtomicU32::fetch_max`]. #[rustc_intrinsic] #[rustc_nounwind] pub unsafe fn atomic_umax(dst: *mut T, src: T) -> T; There is no body. This is a declaration, not a definition. The #[rustc_intrinsic] attribute tells the Rust compiler that this function maps directly to a low-level operation understood by the compiler itself. When the Rust compiler sees a call to intrinsics::atomic_umax , it knows to replace it with the corresponding LLVM intrinsic function. So our journey actually looks like this: fetch_max method (user-facing API) Macro expands to call atomic_umax function atomic_umax is a compiler intrinsic Rustc replaces the intrinsic with LLVM's atomicrmw umax ← We are here LLVM processes this instruction... LLVM runs a series of "passes" that analyze and transform the code. The one we're interested in is called the AtomicExpandPass . Its job is to look at high-level atomic operations like atomicrmw umax and ask the target architecture, "Can you do this natively?" When the x86-64 backend says "No, I can't," this pass expands the single instruction into a sequence of more fundamental ones that the CPU does understand. The result is a compare-and-swap (CAS) loop. We can see this transformation in action by asking LLVM to emit the intermediate representation before and after this pass. To see the IR before the AtomicExpandPass , run: llc -print-before=atomic-expand main.ll -o /dev/null Tip: If you do not have llc installed, you can ask rustc to run the pass for you directly. rustc -C llvm-args="-print-before=atomic-expand -print-after=atomic-expand" main.rs The code will be printed to your terminal. The function containing our atomic max looks like this: *** IR Dump Before Expand Atomic instructions (atomic-expand) *** ; Function Attrs: inlinehint nonlazybind uwtable define internal i64 @_ZN4core4sync6atomic9AtomicU649fetch_max17h6c42d6f2fc1a6124E(ptr align 8 %self, i64 %val, i8 %0) unnamed_addr #1 { start: %_0 = alloca [8 x i8], align 8 %order = alloca [1 x i8], align 1 store i8 %0, ptr %order, align 1 %1 = load i8, ptr %order, align 1 %_7 = zext i8 %1 to i64 switch i64 %_7, label %bb2 [ i64 0, label %bb7 i64 1, label %bb5 i64 2, label %bb6 i64 3, label %bb4 i64 4, label %bb3 ] bb2: ; preds = %start unreachable bb7: ; preds = %start %2 = atomicrmw umax ptr %self, i64 %val monotonic, align 8 store i64 %2, ptr %_0, align 8 br label %bb1 bb5: ; preds = %start %3 = atomicrmw umax ptr %self, i64 %val release, align 8 store i64 %3, ptr %_0, align 8 br label %bb1 bb6: ; preds = %start %4 = atomicrmw umax ptr %self, i64 %val acquire, align 8 store i64 %4, ptr %_0, align 8 br label %bb1 bb4: ; preds = %start %5 = atomicrmw umax ptr %self, i64 %val acq_rel, align 8 store i64 %5, ptr %_0, align 8 br label %bb1 bb3: ; preds = %start %6 = atomicrmw umax ptr %self, i64 %val seq_cst, align 8 store i64 %6, ptr %_0, align 8 br label %bb1 bb1: ; preds = %bb3, %bb4, %bb6, %bb5, %bb7 %7 = load i64, ptr %_0, align 8 ret i64 %7 } You can see the atomicrmw umax instruction in multiple places, depending on the memory ordering specified. This is the high-level atomic operation that the compiler backend understands, but the CPU does not. llc -print-after=atomic-expand main.ll -o /dev/null This is the relevant part of the output: *** IR Dump After Expand Atomic instructions (atomic-expand) *** ; Function Attrs: inlinehint nonlazybind uwtable define internal i64 @_ZN4core4sync6atomic9AtomicU649fetch_max17h6c42d6f2fc1a6124E(ptr align 8 %self, i64 %val, i8 %0) unnamed_addr #1 { start: %_0 = alloca [8 x i8], align 8 %order = alloca [1 x i8], align 1 store i8 %0, ptr %order, align 1 %1 = load i8, ptr %order, align 1 %_7 = zext i8 %1 to i64 switch i64 %_7, label %bb2 [ i64 0, label %bb7 i64 1, label %bb5 i64 2, label %bb6 i64 3, label %bb4 i64 4, label %bb3 ] bb2: ; preds = %start unreachable bb7: ; preds = %start %2 = load i64, ptr %self, align 8 ; seed expected value br label %atomicrmw.start ; enter CAS loop atomicrmw.start: ; preds = %atomicrmw.start, %bb7 %loaded = phi i64 [ %2, %bb7 ], [ %newloaded, %atomicrmw.start ] ; on first iteration: use %2, on retries: use value observed by last cmpxchg %3 = icmp ugt i64 %loaded, %val ; unsigned compare (umax semantics) %new = select i1 %3, i64 %loaded, i64 %val ; desired = max(loaded, val) %4 = cmpxchg ptr %self, i64 %loaded, i64 %new monotonic monotonic, align 8 ; CAS: if *self==loaded, store new %success = extractvalue { i64, i1 } %4, 1 ; boolean: whether the swap happened %newloaded = extractvalue { i64, i1 } %4, 0 ; value seen in memory before the CAS br i1 %success, label %atomicrmw.end, label %atomicrmw.start ; loop until CAS succeeds atomicrmw.end: ; preds = %atomicrmw.start store i64 %newloaded, ptr %_0, align 8 br label %bb1 [... MORE OF THE SAME, JUST FOR DIFFERENT ORDERING..] bb1: ; preds = %bb3, %bb4, %bb6, %bb5, %bb7 %7 = load i64, ptr %_0, align 8 ret i64 %7 } We can see the pass did not change the first part - it still has the code to dispatch based on the memory ordering. But in the bb7 block, where we originally had the atomicrmw umax LLVM instruction, we now see a full compare-and-swap loop. A compiler engineer would say that the atomicrmw umax instruction has been "lowered" into a sequence of more primitive operations, that are closer to what the hardware can actually execute. Here's the simplified logic: Read (seed): grab the current value ( expected ). Compute: desired = umax(expected, val) . Attempt: observed, success = cmpxchg(ptr, expected, desired, [...]) . If success, return observed (the old value). Otherwise set expected = observed and loop. This CAS loop is a fundamental pattern in lock-free programming. The compiler just built it for us automatically. We're at the final step. To see the final machine code, you can tell rustc to emit the assembly directly: rustc --emit=asm main.rs This will produce a main.s file containing the final assembly code. Inside, you'll find the result of the cmpxchg loop: .LBB8_2: movq -32(%rsp), %rax # rax = &self movq (%rax), %rax # rax = *self (seed 'expected') movq %rax, -48(%rsp) # spill expected to stack .LBB8_3: # loop head movq -48(%rsp), %rax # rax = expected movq -32(%rsp), %rcx # rcx = &self movq -40(%rsp), %rdx # rdx = val movq %rax, %rsi # rsi = expected (scratch) subq %rdx, %rsi # set flags for unsigned compare: expected - val cmovaq %rax, %rdx # if (expected > val) rdx = expected; else rdx = val (compute max) lock cmpxchgq %rdx, (%rcx)# CAS: if *rcx==rax then *rcx=rdx; rax <- old *rcx; ZF=success sete %cl # cl = success movq %rax, -56(%rsp) # spill observed to stack testb $1, %cl # branch on success movq %rax, -48(%rsp) # expected = observed (for retry) jne .LBB8_4 # success -> exit jmp .LBB8_3 # failure → retry The syntax might look a bit different from what you're used to, that's because it's in AT&T syntax, which is the default for rustc . If you prefer Intel syntax, you can use rustc --emit=asm main.rs -C "llvm-args=-x86-asm-syntax=intel" to get that. I'm not an assembly expert, but you can see the key parts of the CAS loop here: Seed read (first iteration) : Load *self once to initialize the expected value. : Load once to initialize the expected value. Compute umax without branching : The pair sub + cmova implements desired = max_u(expected, val) . : The pair + implements . CAS operation : On x86-64, cmpxchg uses RAX as the expected value and returns the observed value in RAX ; ZF encodes success. : On x86-64, uses as the expected value and returns the observed value in ; encodes success. Retry or finish: If ZF is clear, we failed and need to retry. Otherwise, we are done. Note we did not ask rustc to optimize the code. If we did, the compiler would generate more efficient assembly: No spills to the stack, fewer jumps, no dispatch on memory ordering, etc. But I wanted to keep the output as close to the original IR as possible to make it easier to follow. And there we have it. Our journey is complete. We started with a safe, clear, single line of Rust and ended with a CAS loop written in assembly language. Rust fetch_max → Macro-generated atomic_umax → LLVM atomicrmw umax → LLVM cmpxchg loop → Assembly lock cmpxchg loop This journey is a perfect example of the power of modern compilers. We get to work at a high level of abstraction, focusing on safety and logic, while the compiler handles the messy, error-prone, and incredibly complex task of generating correct and efficient code for the hardware. So, next time you use an atomic, take a moment to appreciate the incredible, hidden journey your code is about to take. PS: After conducting this journey I learned that C++26 adds fetch_max too! PPS: We are hiring! Out of curiosity, I also checked how this looks on Apple Silicon (AArch64). This architecture does have a native atomic max instruction, so the AtomicExpandPass does not need to lower it into a CAS loop. The LLVM code before and after the pass is identical, still containing the atomicrmw umax instruction. The final assembly contains a variant of the LDUMAX instruction. This is the relevant part of the assembly: ldr x8, [sp, #16] # x8 = value to compare with ldr x9, [sp, #8] # x9 = pointer to the atomic variable ldumax x8, x8, [x9] # atomic unsigned max (relaxed), [x9] = max(x8, [x9]), x8 = old value str x8, [sp, #40] # Store old value b LBB8_11 Note that AArch64 uses Unified Assembler Language, when reading the snippet above, it's important to remember that the destination register comes first. And that's really it. We could continue to dig into the microarchitecture, to see how instructions are executed at the hardware level, what are the effects of the LOCK prefix, dive into differences in memory ordering, etc. But we'll leave that for another day.

From Rust to reality: The hidden journey of fetch_max

Share this article

Related Articles