Notes from Optimizing CPU-Bound Go Hot Paths
Go does a lot of things right. And I love go because of that. But while porting Brotli to pure Go for go-brrr, I kept hitting the same pattern: idiomatic abstractions made hot paths slower, and the fastest version was often hand-duplicated and specialized.
Lack of zero-cost abstractions
In the hot loops I was optimizing, generics, polymorphic dispatch (via interface), and closures often prevented the compiler from producing the same code as the concrete version. The reason is that go doesn't inline these calls in the shapes I was using (we will see problems with inlining very often in this post because inlining is quite important). Yes, the compiler can sometimes inline a direct closure call or devirtualize an interface call, but in the patterns I actually ran into it didn't, and I ate the call overhead. It's clear why interface calls are not inlined. They enable swapping of implementation at runtime rather than compile time. But generics allow to swapping implementation at compile time. If you are coming from languages like c++ or rust you'd expect the generic functions to be monomorphized (all variants pre-generated as concrete functions at compile time) but in go it doesn't happen,at least not in that form. Go uses approach they called GC Shape Stenciling where some parts are pre-generated at compile time but method calls on type parameters end up going through interface-style dispatch (technically the itab is reached via a generics dictionary rather than an ordinary interface argument, but the effect on the hot path is the same). The impossibility of inlining is addressed in the proposal:
The one exception is that method calls won't be fully resolvable at compile time... inlining won't happen in situations where it could happen with a fully stenciled implementation.
So what do we do? Actually, no problem. We just don't use abstractions like generics and duplicate concrete functions. We just take a concrete function duplicate it completely with parts that we wanted to parametrize changed. Needless to say, this will cause lots of duplication. In the Brotli port there were 16 almost identical functions and the only difference between them was that they were calling different versions of the hash function. The 16 variants couldn't be collapsed into one via usage of abstractions because the function is used in hot path.
So performance problem is solved by duplication but this introduces a potentially big problem of maintenance. This can be somewhat mitigated by code generation of course but it's very likely that you will have big number of occurrences where you have only 2-3 duplicating variants which will not justify introduction of codegen.
The next section is a deep dive into benchmarking of the concrete-vs-generic-vs-interface approaches with some exploration of underlying assembly which you can happily skip.
Deep dive
Let's illustrate all described above by example. Here is the function I used in a real codebase reduced from unimportant stuff.
... continue reading