For basically any high-performance computation, memory layout and access pattern are critical. Common wisdom is that linear, contiguous memory performs best and should almost always be preferred. However, it should be intuitively clear that this has diminishing returns: processing a single 32 GB block vs processing two 16 GB blocks will not meaningfully differ in performance. Working with smaller blocks enables some interesting data structures, so I've set out to experimentally determine what block size is needed to effectively capture the full performance.
Findings
Setup and detailed analysis below, but my personal takeaway is:
1 MB blocks are enough for basically any workload of this kind
128 kB blocks suffice once you have at least ~1 cycle per processed byte
4 kB blocks are already enough once you're above roughly ~10 cycles per processed byte
(for raw data processing, not necessarily if there are other per-block costs)
This is the full results chart for my Ryzen 9 7950X3D, effectively showing the block sizes needed for peak performance across different workloads. The rest of this post will go over the setup and discuss a few isolated graphs.
Code and results are available here: github.com/solidean/bench-linear-access
Setup
... continue reading