AVX2 is slower than SSE2-4.x under Windows ARM emulation

If you compile your app for AVX2 and it runs on Windows ARM under Prism emulation, is it faster or slower than compiling for SSE2-4.x?

I assumed it would be roughly the same — maybe slightly slower due to emulation overhead, but AVX2's wider operations would compensate. The headline gives it away: I was wrong.

💡 TLDR: AVX2 code runs at 2/3 the speed of equivalent SSE2-SSE4.x optimised code under emulation on Windows 11 ARM.

'Should I compile for AVX2 if my app might run on Windows ARM?' has a clear answer: No. At least if performance matters.

This post explains how I found out, what I measured and how, the benchmark results, and why.

Curiosity

A few weeks ago, in a Hacker News thread on WoW (the game) emulated performance on Windows ARM, I wondered:

I’ve been testing some math benchmarks on ARM emulating x64, and saw very little performance improvement with the AVX2+FMA builds, compared to the SSE4.x level. (X64 v2 to v3.) ... I’ve found very little info online about this.

Well, I nerdsniped myself, because those math benchmarks are now complete and so we have the perfect framework for testing AVX2+FMA emulation performance overhead on ARM Windows. I have no technical reason to do so: if you use our compiler we encourage that if you want to run your app on Windows ARM to just compile your app for Windows ARM. It's simply: I want to know.

... continue reading