If you compile your app for AVX2 and it runs on Windows ARM under Prism emulation, is it faster or slower than compiling for SSE2-4.x?
I assumed it would be roughly the same — maybe slightly slower due to emulation overhead, but AVX2's wider operations would compensate. The headline gives it away: I was wrong.
💡 TLDR: AVX2 code runs at 2/3 the speed of equivalent SSE2-SSE4.x optimised code under emulation on Windows 11 ARM.
'Should I compile for AVX2 if my app might run on Windows ARM?' has a clear answer: No. At least if performance matters.
This post explains how I found out, what I measured and how, the benchmark results, and why.
Curiosity
A few weeks ago, in a Hacker News thread on WoW (the game) emulated performance on Windows ARM, I wondered:
I’ve been testing some math benchmarks on ARM emulating x64, and saw very little performance improvement with the AVX2+FMA builds, compared to the SSE4.x level. (X64 v2 to v3.) ... I’ve found very little info online about this.
Well, I nerdsniped myself, because those math benchmarks are now complete and so we have the perfect framework for testing AVX2+FMA emulation performance overhead on ARM Windows. I have no technical reason to do so: if you use our compiler we encourage that if you want to run your app on Windows ARM to just compile your app for Windows ARM. It's simply: I want to know.
... continue reading