Illuminating the processor core with LLVM-mca

Performance Tip of the Week #99: Illuminating the processor core with llvm-mca

Originally posted as Fast TotW #99 on September 29, 2025

By Chris Kennelly

Updated 2025-10-07

Quicklink: abseil.io/fast/99

The RISC versus CISC debate ended in a draw: Modern processors decompose instructions into micro-ops handled by backend execution units. Understanding how instructions are executed by these units can give us insights on optimizing key functions that are backend bound. In this episode, we walk through using llvm-mca to analyze functions and identify performance insights from its simulation.

Preliminaries: Varint optimization

llvm-mca , short for Machine Code Analyzer, is a tool within LLVM. It uses the same datasets that the compiler uses for making instruction scheduling decisions. This ensures that improvements made to compiler optimizations automatically flow towards keeping llvm-mca representative. The flip side is that the tool is only as good as LLVM’s internal modeling of processor designs, so certain quirks of individual microarchitecture generations might be omitted. It also models the processor behavior statically, so cache misses, branch mispredictions, and other dynamic properties aren’t considered.

Consider Protobuf’s VarintSize64 method:

size_t CodedOutputStream::VarintSize64(uint64_t value) { #if PROTOBUF_CODED_STREAM_H_PREFER_BSR // Explicit OR 0x1 to avoid calling absl::countl_zero(0), which // requires a branch to check for on platforms without a clz instruction. uint32_t log2value = (std::numeric_limits::digits - 1) - absl::countl_zero(value | 0x1); return static_cast((log2value * 9 + (64 + 9)) / 64); #else uint32_t clz = absl::countl_zero(value); return static_cast( ((std::numeric_limits::digits * 9 + 64) - (clz * 9)) / 64); #endif }

... continue reading