What’s SIMD? Why SIMD?
Hardware that does arithmetic is cheap, so any CPU made this century has plenty of it. But you still only have one instruction decoding block and it is hard to get it to go fast, so the arithmetic hardware is vastly underutilized.
To get around the instruction decoding bottleneck, you can feed the CPU a batch of numbers all at once for a single arithmetic operation like addition. Hence the name: “single instruction, multiple data,” or SIMD for short.
Instead of adding two numbers together, you can add two batches or “vectors” of numbers and it takes about the same amount of time.
On recent x86 chips these batches can be up to 512 bits in size, so in theory you can get an 8x speedup for math on u64 or a 64x speedup on u8 !
Instruction sets
Historically, SIMD instructions were added after the CPU architecture was already designed, so SIMD is an extension with its own marketing name on each architecture.
ARM calls theirs “NEON”, and all 64-bit ARM CPUs have it.
WebAssembly doesn’t have a marketing department, so they just call theirs “WebAssembly 128-bit packed SIMD extension”.
64-bit x86 shipped with one called “SSE2” which has basic instructions for 128-bit vectors, but later they added a whole menagerie of extensions on top of that, with SSE 4.2 adding more operations, AVX and AVX2 adding 256-bit vectors and AVX-512 adding 512-bit vectors.
... continue reading