Every Byte Matters

I have spent a large portion of my career working in Java. In that time, you get used to huge classes. New functionality? Just add a new method and field to the class. The cost of each new field is rarely considered. Performance is often considered from a classic computer science perspective by considering asymptotic analysis of the algorithms and data structures in-use.

Turns out that even within a growth scale for your algorithm, such as a simple for-loop O(N) , time can vary dramatically if we have a little deeper understanding of the underlying hardware.

First, let’s understand our current machine. Let’s take a peek at our cache line and page sizes.

$ lscpu | grep -i cache L1d cache: 352 KiB ( 10 instances ) L1i cache: 640 KiB ( 10 instances ) L2 cache: 10 MiB ( 5 instances ) L3 cache: 12 MiB ( 1 instance ) $ getconf LEVEL1_DCACHE_LINESIZE 64

The instances number is a reflection of how the caches are shared amongst CPUs. If I had 10 CPUs, each one has their own L1d cache, whereas two of them would share an L2 cache.

Our cache line size is 64 bytes.

┌─────────────────────────────────────────────┐ │ 64 bytes │ │ byte 0 byte 1 byte 2 ... byte 63 │ └─────────────────────────────────────────────┘

When you read a single byte from memory, the hardware will fill the surrounding 64 bytes into the cache line. The idea being that data is often temporal and spatially located, meaning data is often accessed near each other and close in time to each other.

We can reference Jeff Dean’s famous “Latency numbers every programmer should know”, however a quick recap with the values from our particular machine is the following:

┌──────────────────────────────────────────────────────────────┐ │ CPU Core │ │ ┌───────────┐ │ │ │ Registers │ < 1 ns │ │ └─────┬─────┘ │ │ ▼ │ │ ┌───────────┐ │ │ │ L1d Cache │ ~35 KiB/core ~4-5 cycles ~1-2 ns │ │ │ │ ~560 cache lines │ │ └─────┬─────┘ │ │ ▼ │ │ ┌───────────┐ │ │ │ L2 Cache │ ~2 MiB/core-pair ~12-15 cycles ~4-5 ns │ │ │ │ ~32,000 cache lines │ │ └─────┬─────┘ │ │ ▼ │ │ ┌───────────┐ │ │ │ L3 Cache │ 12 MiB shared ~30-40 cycles ~10-15 ns │ │ │ │ ~196,000 cache lines │ │ └─────┬─────┘ │ │ ▼ │ │ ┌───────────┐ │ │ │ DRAM │ ~100-200 cycles ~60-100 ns │ │ │ │ │ │ └───────────┘ │ └──────────────────────────────────────────────────────────────┘

... continue reading