Skip to content
Tech News
← Back to articles

How much do amd64 microarchitecture levels help in Go?

read original more articles

Our 64-bit Intel and AMD processors have evolved over decades. When you compile a Go program for a 64-bit Intel or AMD processor, the compiler targets, by default, a nearly 20-year-old instruction set. The binary that comes out runs on essentially any x64 chip, but it also leaves on the table every instruction that was added since 2003.

We often refer to microarchitecture levels. Each level bundles a set of instruction-set extensions that you can assume are present:

Level Adds (roughly) v1 the original AMD64 baseline (SSE2) v2 popcnt , SSE4.2 v3 AVX2 v4 AVX-512 (F/BW/DQ/VL)

In my view, this ladder is already slightly obsolete. It was frozen around 2020, and the hardware has moved on. We would need to add the latest AVX-512 sub-extensions (VBMI, VBMI2, VNNI, BF16, FP16, VPOPCNTDQ, and so on), which recent server and consumer chips support but which v4 does not require. While v1 through v4 are a useful common language, a realistic “use everything this CPU offers” target today would need at least a v5 , and arguably the whole scheme should be replaced by finer-grained feature detection.

In any case, the Go toolchain exposes this v1 through v4 ladder via the GOAMD64 environment variable. Setting GOAMD64=v3 tells the compiler it may use everything up to and including AVX2. The default is v1 , the lowest common denominator.

This raises an obvious question. If I take a real, performance-sensitive library and recompile it at each level, how much do I actually gain? I picked Roaring Bitmaps, a compressed bitset data structure used in databases and search engines.

A Roaring Bitmap stores a set of 32-bit integers. It splits the 32-bit space into chunks of 65,536 values, keyed by the high 16 bits, and stores each chunk in a container that holds only the low 16 bits. A container comes in one of three shapes, and the library always keeps whichever is smallest:

an array container: a sorted list of 16-bit values, used when the chunk is sparse (a few thousand elements at most);

a bitmap container: a flat 8 KB bit vector (65,536 bits, one per possible value), used when the chunk is dense;

a run container: a list of [start, length] intervals, used when the set bits cluster into consecutive runs.

... continue reading