Should you pass data by value, or by reference?
When you’re a certain kind of perfectionist, the kind that prevents you from being productive, this is a question that might plague you every time you write a function.
I’ve heard a few opinions on this, floating around. I’ve heard “stack-to-stack copies are super cheap”, but.. how cheap are they?
Fine, I’ll write a benchmark. Can’t take too long.
3… 2… (several months pass) 1… 🪄
And that’s how I ended up writing a graphing library, and a benchmark.
If you want to run these benchmarks on your own machine, or dump the assembly, you can check out the benchmark code.
This part requires JS. Sorry. I promise all it's doing it creating a graph.
This graph shows the overhead of passing structs of different sizes by value, on different machines.
In general, passing any struct by reference incurs the same overhead as passing a pointer-sized struct by value.
Okay, on a macro level, passing a function parameter by value takes time proportional to the size of the data.
Makes sense.
graph asm - 1 byte asm - 32 bytes ╭─➤ xor edi , edi │ call feed30 < pass_1_fields_by_value > │ sub ebx , 0x1 ╰── jne 4786f0 < bench_1 + 0x20 > ╭─➤ sub rsp , 0x20 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x20 ] │ movups XMMWORD PTR [ rsp ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x30 ] │ movups XMMWORD PTR [ rsp + 0x10 ], xmm0 │ call fef080 < pass_32_fields_by_value > │ add rsp , 0x20 │ sub ebx , 0x1 ╰── jne 479510 < bench_32 + 0x40 >
For smaller struct sizes, it doesn’t look like there’s a huge difference between passing 8 bytes, or passing 32.
In the tabs above, you can inspect the assembly for the benchmark loops of 1-byte and 32-byte structs.
The assembly says there’s a difference, but seemingly not one you’d notice. Technically the 32-byte value is being passed on the stack, using vector registers, whereas the 1-byte value is being passed in a register, but the speed difference is negligible.
graph asm - 256 bytes asm - 257 bytes ╭─➤ sub rsp , 0x100 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x100 ] │ movups XMMWORD PTR [ rsp ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x110 ] │ movups XMMWORD PTR [ rsp + 0x10 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x120 ] │ movups XMMWORD PTR [ rsp + 0x20 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x130 ] │ movups XMMWORD PTR [ rsp + 0x30 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x140 ] │ movups XMMWORD PTR [ rsp + 0x40 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x150 ] │ movups XMMWORD PTR [ rsp + 0x50 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x160 ] │ movups XMMWORD PTR [ rsp + 0x60 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x170 ] │ movups XMMWORD PTR [ rsp + 0x70 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x180 ] │ movups XMMWORD PTR [ rsp + 0x80 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x190 ] │ movups XMMWORD PTR [ rsp + 0x90 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1a0 ] │ movups XMMWORD PTR [ rsp + 0xa0 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1b0 ] │ movups XMMWORD PTR [ rsp + 0xb0 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1c0 ] │ movups XMMWORD PTR [ rsp + 0xc0 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1d0 ] │ movups XMMWORD PTR [ rsp + 0xd0 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1e0 ] │ movups XMMWORD PTR [ rsp + 0xe0 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1f0 ] │ movups XMMWORD PTR [ rsp + 0xf0 ], xmm0 │ call fefe80 < pass_256_fields_by_value > │ add rsp , 0x100 │ sub ebx , 0x1 ╰── jne 4927c0 < bench_256 + 0x100 > ╭─➤ sub rsp , 0x110 │ mov ecx , 0x20 │ mov rsi , rbp │ mov rdi , rsp │ rep movs QWORD PTR es :[ rdi ], QWORD PTR ds :[ rsi ] │ movzx eax , BYTE PTR [ rsi ] │ mov BYTE PTR [ rdi ], al │ call fefe90 < pass_257_fields_by_value > │ add rsp , 0x110 │ sub ebx , 0x1 ╰── jne 492a00 < bench_257 + 0x110 >
There’s a beautiful clean cliff here at 257 bytes. This seems to represent the difference between an unrolled, vectorized memcpy routine, and one using rep movs (ie. a microcoded for loop).
In the tabs above, you can inspect the assembly for 256 and 257 bytes.
graph asm - 1682 bytes asm - 1696 bytes ╭─➤ sub rsp , 0x690 │ mov ecx , 0xd1 │ mov rsi , rbp │ mov rdi , rsp │ rep movs QWORD PTR es :[ rdi ], QWORD PTR ds :[ rsi ] │ call < pass_1672_fields_by_value > │ add rsp , 0x690 │ sub ebx , 0x1 ╰── jne < bench_1672 + 0x120 > ╭─➤ sub rsp , 0x6a0 │ mov ecx , 0xd4 │ mov rsi , rbp │ mov rdi , rsp │ rep movs QWORD PTR es :[ rdi ], QWORD PTR ds :[ rsi ] │ call < pass_1696_fields_by_value > │ add rsp , 0x6a0 │ sub ebx , 0x1 ╰── jne < bench_1696 + 0x100 >
Well this is interesting.
The period is 32, with 8 fast, and 24 slow struct widths per period.
Now, you may think that this would be something to do with having to copy other leftover memory after rep movs , but that doesn’t seem to be the case. I’ve dumped the assembly of two functions above, one from the valley, and one from the hill, which have near-identical assembly.
Instead, it looks like the performance of rep movs itself is periodic.
graph asm - 4064 bytes asm - 4072 bytes ╭─➤ sub rsp , 0xfe0 │ mov ecx , 0x1fc │ mov rsi , rbp │ mov rdi , rsp │ rep movs QWORD PTR es :[ rdi ], QWORD PTR ds :[ rsi ] │ call < pass_4064_fields_by_value > │ add rsp , 0xfe0 │ sub ebx , 0x1 ╰── jne < bench_4064 + 0x100 > ╭─➤ sub rsp , 0xff0 │ mov ecx , 0x1fd │ mov rsi , rbp │ mov rdi , rsp │ rep movs QWORD PTR es :[ rdi ], QWORD PTR ds :[ rsi ] │ call < pass_4072_fields_by_value > │ add rsp , 0xff0 │ sub ebx , 0x1 ╰── jne < bench_4072 + 0x120 >
Alright, what’s going on here?
Passing a struct of 4064 bytes takes 53ns, but passing one of 4065 bytes takes 222ns, in other words 4x longer.
And yes, these results are reproducible. I can run the benchmark as many times as I want, and always get the same spikes.
Again, I’ve dumped what is essentially matching assembly from the non-hill and from the hill, in the tabs above.
Since the spike doesn’t persist for greater struct widths, this probably isn’t a case of hitting a CPU cache limit.
It would appear that the rep movs instruction has a serious performance bug for these very specific ranges, as implemented in AMD Zen* CPU microcode. If any AMD engineers know what’s going on, please shoot me an email :)
If you work on GCC, Clang, or another compiler, and want to add an absolutely disgusting hack that removes 1-2 rep movs iterations, and adds a few extra manual mov s, on my CPU (and probably other AMD CPUs), I’d also love to hear about it.