Quantifying pass-by-value overhead

Should you pass data by value, or by reference? When you’re a certain kind of perfectionist, the kind that prevents you from being productive, this is a question that might plague you every time you write a function. I’ve heard a few opinions on this, floating around. I’ve heard “stack-to-stack copies are super cheap”, but.. how cheap are they? Fine, I’ll write a benchmark. Can’t take too long. 3… 2… (several months pass) 1… 🪄 And that’s how I ended up writing a graphing library, and a benchmark. If you want to run these benchmarks on your own machine, or dump the assembly, you can check out the benchmark code. This part requires JS. Sorry. I promise all it's doing it creating a graph. This graph shows the overhead of passing structs of different sizes by value, on different machines. In general, passing any struct by reference incurs the same overhead as passing a pointer-sized struct by value. Okay, on a macro level, passing a function parameter by value takes time proportional to the size of the data. Makes sense. graph asm - 1 byte asm - 32 bytes ╭─➤ xor edi , edi │ call feed30 < pass_1_fields_by_value > │ sub ebx , 0x1 ╰── jne 4786f0 < bench_1 + 0x20 > ╭─➤ sub rsp , 0x20 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x20 ] │ movups XMMWORD PTR [ rsp ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x30 ] │ movups XMMWORD PTR [ rsp + 0x10 ], xmm0 │ call fef080 < pass_32_fields_by_value > │ add rsp , 0x20 │ sub ebx , 0x1 ╰── jne 479510 < bench_32 + 0x40 > For smaller struct sizes, it doesn’t look like there’s a huge difference between passing 8 bytes, or passing 32. In the tabs above, you can inspect the assembly for the benchmark loops of 1-byte and 32-byte structs. The assembly says there’s a difference, but seemingly not one you’d notice. Technically the 32-byte value is being passed on the stack, using vector registers, whereas the 1-byte value is being passed in a register, but the speed difference is negligible. graph asm - 256 bytes asm - 257 bytes ╭─➤ sub rsp , 0x100 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x100 ] │ movups XMMWORD PTR [ rsp ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x110 ] │ movups XMMWORD PTR [ rsp + 0x10 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x120 ] │ movups XMMWORD PTR [ rsp + 0x20 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x130 ] │ movups XMMWORD PTR [ rsp + 0x30 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x140 ] │ movups XMMWORD PTR [ rsp + 0x40 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x150 ] │ movups XMMWORD PTR [ rsp + 0x50 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x160 ] │ movups XMMWORD PTR [ rsp + 0x60 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x170 ] │ movups XMMWORD PTR [ rsp + 0x70 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x180 ] │ movups XMMWORD PTR [ rsp + 0x80 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x190 ] │ movups XMMWORD PTR [ rsp + 0x90 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1a0 ] │ movups XMMWORD PTR [ rsp + 0xa0 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1b0 ] │ movups XMMWORD PTR [ rsp + 0xb0 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1c0 ] │ movups XMMWORD PTR [ rsp + 0xc0 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1d0 ] │ movups XMMWORD PTR [ rsp + 0xd0 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1e0 ] │ movups XMMWORD PTR [ rsp + 0xe0 ], xmm0 │ movdqa xmm0 , XMMWORD PTR [ rsp + 0x1f0 ] │ movups XMMWORD PTR [ rsp + 0xf0 ], xmm0 │ call fefe80 < pass_256_fields_by_value > │ add rsp , 0x100 │ sub ebx , 0x1 ╰── jne 4927c0 < bench_256 + 0x100 > ╭─➤ sub rsp , 0x110 │ mov ecx , 0x20 │ mov rsi , rbp │ mov rdi , rsp │ rep movs QWORD PTR es :[ rdi ], QWORD PTR ds :[ rsi ] │ movzx eax , BYTE PTR [ rsi ] │ mov BYTE PTR [ rdi ], al │ call fefe90 < pass_257_fields_by_value > │ add rsp , 0x110 │ sub ebx , 0x1 ╰── jne 492a00 < bench_257 + 0x110 > There’s a beautiful clean cliff here at 257 bytes. This seems to represent the difference between an unrolled, vectorized memcpy routine, and one using rep movs (ie. a microcoded for loop). In the tabs above, you can inspect the assembly for 256 and 257 bytes. graph asm - 1682 bytes asm - 1696 bytes ╭─➤ sub rsp , 0x690 │ mov ecx , 0xd1 │ mov rsi , rbp │ mov rdi , rsp │ rep movs QWORD PTR es :[ rdi ], QWORD PTR ds :[ rsi ] │ call < pass_1672_fields_by_value > │ add rsp , 0x690 │ sub ebx , 0x1 ╰── jne < bench_1672 + 0x120 > ╭─➤ sub rsp , 0x6a0 │ mov ecx , 0xd4 │ mov rsi , rbp │ mov rdi , rsp │ rep movs QWORD PTR es :[ rdi ], QWORD PTR ds :[ rsi ] │ call < pass_1696_fields_by_value > │ add rsp , 0x6a0 │ sub ebx , 0x1 ╰── jne < bench_1696 + 0x100 > Well this is interesting. The period is 32, with 8 fast, and 24 slow struct widths per period. Now, you may think that this would be something to do with having to copy other leftover memory after rep movs , but that doesn’t seem to be the case. I’ve dumped the assembly of two functions above, one from the valley, and one from the hill, which have near-identical assembly. Instead, it looks like the performance of rep movs itself is periodic. graph asm - 4064 bytes asm - 4072 bytes ╭─➤ sub rsp , 0xfe0 │ mov ecx , 0x1fc │ mov rsi , rbp │ mov rdi , rsp │ rep movs QWORD PTR es :[ rdi ], QWORD PTR ds :[ rsi ] │ call < pass_4064_fields_by_value > │ add rsp , 0xfe0 │ sub ebx , 0x1 ╰── jne < bench_4064 + 0x100 > ╭─➤ sub rsp , 0xff0 │ mov ecx , 0x1fd │ mov rsi , rbp │ mov rdi , rsp │ rep movs QWORD PTR es :[ rdi ], QWORD PTR ds :[ rsi ] │ call < pass_4072_fields_by_value > │ add rsp , 0xff0 │ sub ebx , 0x1 ╰── jne < bench_4072 + 0x120 > Alright, what’s going on here? Passing a struct of 4064 bytes takes 53ns, but passing one of 4065 bytes takes 222ns, in other words 4x longer. And yes, these results are reproducible. I can run the benchmark as many times as I want, and always get the same spikes. Again, I’ve dumped what is essentially matching assembly from the non-hill and from the hill, in the tabs above. Since the spike doesn’t persist for greater struct widths, this probably isn’t a case of hitting a CPU cache limit. It would appear that the rep movs instruction has a serious performance bug for these very specific ranges, as implemented in AMD Zen* CPU microcode. If any AMD engineers know what’s going on, please shoot me an email :) If you work on GCC, Clang, or another compiler, and want to add an absolutely disgusting hack that removes 1-2 rep movs iterations, and adds a few extra manual mov s, on my CPU (and probably other AMD CPUs), I’d also love to hear about it.

Quantifying pass-by-value overhead

Share this article

Related Articles