PivCo-Huffman "Merge" Operations

There’s a new paper out called “PivCo-Huffman” (HTML version with annotations here) and it’s very interesting. Normal Huffman decoding (and, to a lesser extent, encoding) is inherently quite serial. We can get explicit parallelism by using multiple streams, which scales just fine to moderate numbers of streams – something like 4-8 is usually not an issue. Not very suitable for vectorization or wide vector machines like GPUs, though: every extra stream adds signaling overhead in the bitstream, and decoding from many streams at once is a gather-heavy operation. GPUs can cope with gathers okay (although spread-out data accesses still consume a lot more cycles than more compact memory access patterns do), most everything else is very unhappy with extremely gather-heavy patterns.

Especially since our usual table-based decoders will then immediately follow up with another gather, although that part is more easily fixed: canonical codes admit reasonably easy table-less decoding (at least for the critical “length determination” part), if desired. These approaches are not particularly juicy in a scalar one-at-a-time decoder, where a single load is very cheap, but if you’re working on 32-wide vectors, doing a bunch of integer math that saves you a gather is a more interesting proposition.

You can also use ANS-style interleaving to turn many logical bitstreams into a single physical bitstream. This is what GDeflate does to get a bitstream that has more potential parallelism without gathering from all over the place. Interleaving solves the “reads all over memory” problem and works great on GPUs and CPUs that provide suitable facilities (e.g. AVX-512 “EXPAND” family instructions, or just fast gathers from the same cache line), but is slow without that kind of hardware support. It also forces you to pick a magic number (the interleave factor) as part of the bitstream definition. Too low, and wide vector machines like GPUs get bad utilization when decoding. Too high, and scalar machines can’t efficiently decode the bitstream at all. Worse, numbers that are just right for say AVX-512 or GPUs overlap, but the smallest number of interleaved streams most GPUs or even CPUs with something like AVX-512 can get decent utilization with is around 32, the largest number scalar or narrower vector machines with typical register counts can comfortably decode is around 8-16 (depending on how exactly you do it), and in the middle between 16 and 32 is a valley of tears where nobody’s happy.

That’s why you don’t see much of this style of bitstream outside of GDeflate; it is inherently reliant on you picking a magic number that has a huge influence on every single implementation, yet at the same time is subject to ever-shifting hardware details. Yes, NVidia GPUs have been built around warps with 32 “threads” (more like SIMD lanes) using 32 bits per register each for a good while, but AMD used to prefer 64 (before switching to mostly-32 a few generations ago), Intel GPUs were designed around 8, some mobile GPUs used to do as little as 4, and 4 is also the number of 32-bit lanes you get for something like ARM NEON or SSE4.2 on the x86 side, later widened to 8 with AVX2. Finally, ARM SVE and the RISC-V V extension leave it implementation defined, which is one thing when the model is that you’re running a simple vector math kernel like a SAXPY, but quite another when the physical vector width ends up being a magic constant that is baked into some persistent wire format, potentially for decades.

The final major popular solution to the “this is very serial” conundrum is to completely brute-force it. Fetch a bunch of bits from the input. In parallel, start decoding at bit 0, and at a bit 1, at bit 2, 3, and so forth. Do this for a lot of sequential candidate positions. I’m talking something like 16 or more.

Eventually, we discover that the code starting at bit 0 was 6 bits long. Cool! That means we discard all the work we did at bit offsets 1 through 6, and look at what our already pre-decoded code starting at bit 6 was. That one is 5 bits long, so we discard the speculative work we did for bits 7 through 10, and find that starting at bit 11, there’s another code that’s 5 bits long. Congratulations, we’ve just decoded 3 values for the price of 16.

This works just fine. Believe it or not, this is actually the preferred method of decoding such a sequential bitstream in hardware if a throughput of more than one value per cycle is required. Especially with canonical codes and not-too-large length limits, this can be made reasonably cheap. Chasing down the sequence of actual values to emit at the end is a potential scalability problem, but 3 or 4 at a time isn’t too bad, and if that’s not enough, there’s better algorithms to chase these links that give yet more parallelism. In short, this can be done, and if you build dedicated hardware for it, it’s not bad, but it’s not something you ever want to do in SW on general-purpose hardware. I’ve implemented it out of curiosity, and even on a GPU, it turns out routinely throwing away more than 80% of the work you do is not a recipe for fast code, and let’s not even talk about wasted energy when you try to pull this kind of stunt on a power-constrained battery-powered device.

So far, that’s been our main options: we can decode sequentially one at a time, but that doesn’t scale. We can use multiple separate bitstreams, which theoretically gives us something like thread-level or perhaps instruction-level parallelism, but that amount of parallelism is baked into the bitstream and adds some amount of overhead (since each decoder threads needs to know where to start). It’s also not friendly to SIMD-style implementations because the divergent memory access patterns make us sad. We can use interleaved bitstreams which have much better memory access locality and are a good match to SIMD and GPU architectures in principle, but need us to pick a magic number at bitstream specification time that will forever limit the amount of decoder parallelism we can get, but also ruin our day if the number we pick is high enough to make some of our target HW run out of registers. And even within the same class of device (say, GPUs), there’s not anything resembling agreement on what that number ought to be. Finally, we can shrug, be extremely wasteful, and discard most of the work we did immediately. It’s parallelism all right, but it’s hardly work-efficient.

Enter PivCo

Marcin Żukowski’s “Pivco-Huffman” turns this on its side. Literally. I’ll go through this rather quickly to get to the meat of this post, refer to the paper for more details. For concreteness, suppose we want to encode the string “abracadabra”. Building the standard Huffman tree (it’s a textbook algorithm that I’m not going to explain here) yields something like this picture:

... continue reading