Two studies in compiler optimisations

Introduction

While many performance-oriented programmers are intimately acquainted with the almost preternatural ability of modern compilers to optimise their code, and many of us have spent countless hours on Compiler Explorer examining the differences between the Assembly generated by different versions of gcc and clang, most have likely not looked under the hood to see how the magic happens. It is a testament to their quality that most of us simply treat compilers as black boxes: more or less readable code goes in, fast binaries come out. Sometimes, however, seemingly innocuous changes—perhaps even meant to help the compiler—can cause surprising performance issues which we are hard-pressed to explain without a deeper understanding of the underlying machinery.

In this post we’ll dive into the implementation of some of the LLVM optimisation passes using two simple examples which, nonetheless, will help us pull back the veil on the complexity involved in producing highly-optimised code. We will see how small source changes can trigger different paths in the compiler’s internal processing with unexpected consequences, demonstrating how achieving high performance can be as much an art as it is a science for both compiler developers and users. I have also included a few exercises for those interested in getting their hands dirty, but they are not required to follow along with the main text.

I use LLVM 22.1.0 as the reference implementation throughout this post. The examples are written in (very basic) C++23 and target x86-64, and the Assembly code uses Intel syntax. Prior knowledge of LLVM IR is not required but it can be helpful (I recommend A Gentle Introduction to LLVM IR).

Case 1: Modular increment

The scenario

Consider the following C++ function to get the next index into an array or vector of elements accessed in a round-robin fashion, with cur being the current index and count the number of elements:

unsigned next_naive ( unsigned cur , unsigned count ) { return ( cur + 1 ) % count ; }

As written, this code requires an expensive 32-bit division instruction (6 cycles with 12-cycle latency on an Intel IceLake core, worse in previous generations). There are, of course, numerous tricks to replace divisions by constants with cheaper arithmetic operations—powers of two being the best-known case—but since here count is a dynamic runtime value the compiler cannot help us:

next_naive ( unsigned int , unsigned int ): lea eax , [ rdi + 1 ] xor edx , edx div esi mov eax , edx ret

... continue reading