Beyond OpenMP in C++ and Rust: Taskflow, Rayon, Fork Union

TL;DR: Most C++ and Rust thread-pool libraries leave significant performance on the table - often running 10× slower than OpenMP on classic fork-join workloads and micro-benchmarks. So I’ve drafted a minimal ~300-line library called Fork Union that lands within 20% of OpenMP. It does not use advanced NUMA tricks; it uses only the C++ and Rust standard libraries and has no other dependencies.

OpenMP has been the industry workhorse for coarse-grain parallelism in C and C++ for decades. I lean on it heavily in projects like USearch, yet I avoid it in larger systems because:

Fine-grain parallelism with independent subsystems doesn’t map cleanly to OpenMP’s global runtime.

with independent subsystems doesn’t map cleanly to OpenMP’s global runtime. Portability of the C++ STL and the Rust standard library is better than OpenMP.

of the C++ STL and the Rust standard library is better than OpenMP. Meta-programming with OpenMP is a pain - mixing #pragma omp with templates quickly becomes unmaintainable.

So I went looking for ready-made thread pools in C++ and Rust — only to realize most of them implement asynchronous task queues, a much heavier abstraction than OpenMP’s fork-join model. Those extra layers introduce what I call the four horsemen of low performance:

Locks & mutexes with syscalls in the hot path. Heap allocations in queues, tasks, futures, and promises. Compare-and-swap (CAS) stalls in the pessimistic path. False sharing unaligned counters thrashing cache lines.

With today’s dual-socket AWS machines pushing 192 physical cores, I needed something leaner than Taskflow, Rayon, or Tokio. Enter Fork Union.

Hardware: AWS Graviton 4 metal (single NUMA node, 96× Arm v9 cores, 1 thread/core). Workload: “ParallelReductionsBenchmark” - summing single-precision floats in parallel. In this case, just one cache line ( float[16] ) per core—small enough to stress synchronization cost of the thread pool rather than arithmetic throughput of the CPU. In other words, we are benchmarking kernels similar to:

#include float parallel_sum ( std :: array < float , 96 * 16 > const & data ) { float result = 0.0f ; #pragma omp parallel for reduction(+:result) // Not how we profile OpenMP for ( std :: size_t i = 0 ; i < data . size (); ++ i ) result += data [ i ]; return result ; }

... continue reading