Performance Improvements in Libffi

libffi is a function call interpreter. You hand it a description of a function’s signature at runtime, and it works out, on the spot, how to place each argument and make the call. It interprets the calling convention the way a bytecode VM interprets instructions. Nothing is compiled ahead of time, because the whole point is that you don’t know the signature ahead of time.

An interpreter is not what you reach for when you want speed. The usual answer is to JIT: compile a bespoke call stub for each signature, native code that drops the arguments into their registers and jumps, with nothing left to interpret at runtime. It’s quicker, but it gets there by writing fresh machine code into memory that’s both writable and executable, which is exactly what modern systems are trying to stamp out.

So libffi stays an interpreter, on purpose. The question I set out to answer was how much faster it could get that way, by reusing what it already knows instead of generating code at runtime or mapping any page writable and executable.

The waste#

When you call a function through libffi, the work splits across two places. ffi_prep_cif runs once per signature. It classifies the whole thing, but it keeps only two results: the size of the stack frame the call will need, and a small code for how the return value comes back. The frame size has to be known before the call is built, because any argument that doesn’t fit in a register spills to the stack, and that space is reserved up front. The return code is for afterward, because the result comes back in rax , or xmm0 , or memory depending on the type, and something has to know where to read it from. Both are small and fixed-size, so they live in the ffi_cif . What prep throws away is the part it spent most of its time on: where each individual argument goes.

So on every ffi_call , the marshalling code walks the argument list again and re-derives that placement from scratch before copying the values into place. For a three-argument call on x86-64 that’s around 650 instructions of bookkeeping, and it produces the identical answer every single time.

Most of those instructions aren’t moving argument bytes. They’re deciding where the bytes go. The System V AMD64 ABI classifies every argument by a fixed procedure, and running that procedure on a single argument means walking its type, recursing into a struct’s fields and chasing the pointers in its type descriptor, sorting each 8-byte chunk into an INTEGER or SSE register class, and checking whether it still fits in the registers that are left or has to spill to the stack. That is branch-heavy, pointer-chasing work, the sort a CPU runs slowly, and it reruns on every call to compute a placement that never changes.

But function argument placement is a pure function of the signature. We can compute it once, remember it, and skip the work on every later call.

A plan#

The fix is a “plan”: the placement compiled into a flat list of moves, a tiny bytecode for one signature. If ffi_call re-deriving the placement on every call is like interpreting a program by re-walking its syntax tree each time, the plan is the compiled bytecode: the tree-walk happens once, and every later call just runs the flat list. build_plan walks the argument types once, classifies each one the way the ABI rules say, and emits a move per piece: this 8-byte word goes in rdi , that 32-bit int gets sign-extended into rsi , this double lands in an SSE slot, that oversized thing spills to the stack. With the plan in hand, making the call is just running the moves. No re-classification.

... continue reading