TLDR
On commercial hardware, RwLock was ~5× slower than Mutex for a read-heavy cache workload due to atomic contention and cache-line ping-pong.
Introduction
This is a story about how “obvious” optimizations can backfire.
While building Redstone, a high-performance Tensor Cache in Rust, I hit a wall with write-lock contention.I thought a read lock would probably mitigate this (boy, was I wrong :( ). I expected the throughput to go through the roof since multiple threads could finally read simultaneously, the competition was not even close, write locks outperformed read locks by around 5X.
Here is why read locks may perform worse (and destroy expectations).
System Context & Workload
The experiment was simple: I was benchmarking a Least Recently Used (LRU) tensor cache.
Hardware: Apple Silicon M4 (10 cores, 16GB RAM).
Apple Silicon M4 (10 cores, 16GB RAM). Software: Rust 1.92.0, using the parking_lot::RwLock .
... continue reading