Once upon a time I worked in the field of Java Optimizations.
The target system was a distributed data processing platform that ran across hundreds of thousands of machines. At such scales, a 0.5% improvement would easily make up my salary going forward, and 2% was a good result for the half.
That doesn’t mean it was easy. Never have I ever seen such a highly optimized Java codebase. Not before, not since.
Every low hanging fruit had long since been picked clean. For example, there was minimal use of java.lang.String because ain’t nobody got time for a second allocation of a char[] with associated indirection and GC churn.
In other words, we were looking at increasingly heroic optimizations to justify our own existence.
Through careful, large scale profiling, I discovered that data serialization was a small but worthwhile place to focus. In particular, VarInt encoding used enough CPU time to warrant investigation.
If you’re not familiar, a VarInt, or more specifically ULEB128, is a byte efficient way to encode arbitrary integers. You use however many groups of 7 bits you need for your number, and use the 8th bit to specify whether there is another byte coming.
Numbers needing 7-bits or less (< 128) are encoded as 0nnnnnn
14-bit (<16384), 1nnnnnnn 0nnnnnnn
21bit (<2097152), 1nnnnnnn 1nnnnnnn 0nnnnnnn
... continue reading