Why is calling my asm function from Rust slower than calling it from C?

This is a follow-up to making the rav1d video decoder 1% faster, where we compared profiler snapshots of rav1d (the Rust implementation) and dav1d (the C baseline) to find specific functions that were slower in the Rust implementation.

Today, we are going to pay off a small debt from that post: since dav1d and rav1d share the same hand-written assembly functions, we used them as anchors to navigate the different implementations - they, at least, should match exactly! And they did. Well, almost all of them did.

This, dear reader, is the story of the one function that didn’t.

An Overview

We’ll need to ask - and answer! - three ‘Whys’ today:

Using the same techniques from last time, we’ll see that a specific assembly function is, indeed, slower in the Rust version.

But why? ➡️ Because loading data in the Rust version is slower, which we discover using samply ’s special asm view. 1 But why? ➡️ Because the Rust version stores much more data on the stack, which we find by playing with some arguments and looking at the generated LLVM IR. 2 But why? ➡️ Because the compiler cannot optimize away a specific Rust abstraction across function pointers! 3

Which we fix by switching to a more compiler-friendly version (PR). 4

Side note: again, we’ll be running all these benchmarks on a MacBook, so our tools are a tad limited and we’ll have to resort to some guesswork. Leave a comment if you know more - or, even better, write an article about profiling on macOS 🍎💨.

... continue reading