Skip to content
Tech News
← Back to articles

How much linear memory access is enough?

read original get Memory Access Analyzer Tool → more articles
Why This Matters

This article highlights that in high-performance computing, memory block sizes of around 1 MB are sufficient to maximize performance, challenging the traditional emphasis on contiguous memory. Understanding optimal block sizes can lead to more efficient data structures and processing strategies, benefiting both hardware designers and software developers. It underscores that smaller blocks can often achieve near-peak performance, enabling more flexible and scalable data processing solutions.

Key Takeaways

For basically any high-performance computation, memory layout and access pattern are critical. Common wisdom is that linear, contiguous memory performs best and should almost always be preferred. However, it should be intuitively clear that this has diminishing returns: processing a single 32 GB block vs processing two 16 GB blocks will not meaningfully differ in performance. Working with smaller blocks enables some interesting data structures, so I've set out to experimentally determine what block size is needed to effectively capture the full performance.

Findings

Setup and detailed analysis below, but my personal takeaway is:

1 MB blocks are enough for basically any workload of this kind

128 kB blocks suffice once you have at least ~1 cycle per processed byte

4 kB blocks are already enough once you're above roughly ~10 cycles per processed byte

(for raw data processing, not necessarily if there are other per-block costs)

This is the full results chart for my Ryzen 9 7950X3D, effectively showing the block sizes needed for peak performance across different workloads. The rest of this post will go over the setup and discuss a few isolated graphs.

Code and results are available here: github.com/solidean/bench-linear-access

Setup

... continue reading