Tech News
← Back to articles

Faster C software with Dynamic Feature Detection

read original related products more articles

Faster C software with Dynamic Feature Detection

I've been building some software recently whose performance is very sensitive to the capabilities of the CPU on which it's running. A portable version of the code does not perform all that well, but we cannot guarantee the presence of optional Instruction Set Architectures (ISAs) which we can use to speed it up. What to do? That's what we'll be looking at today, mostly for the wildly popular x86-64 family of processors (but the general techniques apply anywhere).

Make it the compiler's problem.

Compilers are very good at optimising for a particular target CPU microarchitecture, such as if you use -march=native (or e.g. -march=znver3 ). They know amongst other things, the ISA capabilities of these CPUs and they will quietly take advantage of them at cost of portability.

So the first way to speed up C software is to build for a more recent architecture where the compiler has the tools to speed the code up for you. This won't work for every problem or scenario, but if it's an option for you, it's very easy.

This works surprisingly well on x86-64 because it's now a very mature architecture. But this also means that there's a wide span of capabilities between the original x86-64 CPUs and the CPUs you can buy nowadays. To help make things a bit more digestible, intel devised microarchitecture levels, with later levels including all the features of its predecessors:

Level Contains e.g. Intel AMD x86-64-v1 (base) All 64 bit All 64 bit x86-64-v2 POPCNT, SSE4.2 2008 (Nehalem/Westmere) 2011 (Bulldozer) x86-64-v3 AVX2, BMI2 2013 (Haswell/Broadwell) 2015 (Excavator) x86-64-v4 AVX-512[1] 2017 (Skylake) 2022 (Zen 4)

[1] AVX-512 is not actually one feature, but v4 includes the most useful parts of it.

There are some gotchas I won't dwell on, but not all kit released after these dates is good for these capabilities, in particular there have been:

Slow implementations of some instructions (e.g. PEXT/PDEP in BMI2 in AMD before Zen 3)

... continue reading