Faster substring search with SIMD in Zig
I’ve been learning a lot about low-level programming languages lately, and for a long time there has been one thing that has interested me: SIMD (or ‘single instruction, multiple data’) code. I’ve seen a lot of articles about having massive performance gains by utilizing SIMD and wanted to learn how to do it myself.
This article is a journey into implementing ~60% faster substring searching compared to Zig’s std.mem.indexOf using a SIMD-friendly algorithm.
This is the baseline function that we will be comparing against:
fn find_substr ( needle : [] const u8 , haystack : [] const u8 ) ? usize { return std . mem . indexOf ( u8 , haystack , needle ); }
It’s the closest thing to a substring search function from Zig’s standard library. It returns the first index of a subsequence – or null if not found.
SIMD algorithm #
This algorithm is taken directly from Wojciech Muła’s fantastic article SIMD-friendly algorithms for substring searching, which seems to have the best algorithms for finding substrings in a large body of text.
Here’s how the algorithm works: say that we want to find the index of the word “blue” (the needle ) in “It was a beautiful, bounteous, blue day” (the haystack ). First, we extract the first and last character of the needle (‘b’ and ’e’) and store them in a variable.
Then we will loop through all of the characters in haystack , loading the next 32 characters (bytes) from memory into a SIMD register and comparing each character (byte) in the register with ‘b’. This will result in a mask containing 32 bytes, 1 if the character is ‘b’ and 0 in all other cases.
We will do the same with the last character, but load the characters with an offset ( needle.len - 1 ).
Without the offset, any match that starts in one 32‑byte chunk and ends in the next would be missed. With this method, we can also check for needles that are longer than 32 characters.
The result will be two bit masks, First and Last , where we can use bit-wise AND ( Result = First & Last ) to figure out potential substring occurrences.
Result will be 1 only when there is a ‘b’ at index i followed by an ’e’ at index i+3 . We still need to check if those positions actually contain the value “blue”, but this still dramatically reduces the number of checks (= individual memory accesses) that are necessary. We’ll see how this works in practice in the next section.
Implementation in Zig #
First, to properly use SIMD, let’s assume that the CPU supports AVX2 (Advanced Vector Extensions 2) and has 256-bit wide registers.
All desktop processors less than 10 years old support AVX2, with newer ones also supporting AVX-512 with 512-bit wide registers.
This allows us to use Zig’s @Vector function to make a type:
const Block = @Vector ( 32 , u8 ); // number of elements, element type (32*8=256)
By using Block , we are telling the compiler that the operations on this datatype should use SIMD instructions where possible.
Next, we take the first and last letters of the search word (’needle’) and load them into two SIMD registers, so that every byte of the register is filled with the character. This is handled by another built-in function, @splat :
const first_letter : Block = @splat ( needle [ 0 ]); const last_letter : Block = @splat ( needle [ needle . len - 1 ]);
In the main loop, we check that there is enough characters left in haystack so that we can read the next 32 + needle.len characters. Inside the block, we load the blocks that we’re going to compare first_letter and last_letter with.
const n = haystack . len ; const k = needle . len ; var i : usize = 0 ; while ( i + k + 32 <= n ) : ( i += 32 ) { const first_block : Block = haystack [ i ..][ 0 .. 32 ]. * ; const last_block : Block = haystack [ i + k - 1 ..][ 0 .. 32 ]. * ;
Now we can make the comparisons and combine them into a mask:
// ... const eq_first = first_letter == first_block ; const eq_last = last_letter == last_block ; var mask : std . bit_set . IntegerBitSet ( 32 ) = .{ . mask = @bitCast ( eq_first & eq_last ) };
Here we can use an IntegerBitSet from Zig’s standard library. We construct it by casting the result of eq_first & eq_last into a 32-bit integer. If the resulting mask is non-zero, there are candidates in the current block.
// ... while ( mask . findFirstSet ()) | bitpos | { if ( std . mem . eql ( u8 , haystack [ i + bitpos + 1 ..][ 0 .. k - 1 ], needle [ 1 ..])) { return i + bitpos ; } _ = mask . toggleFirstSet (); }
The first and last characters of the substring are checked already, so we don’t need to check their equality again.
Finally, if there are leftover characters, we can fall back to std.mem.IndexOf .
// ... // Fallback to scalar search for the tail if ( i < n ) { if ( std . mem . indexOf ( u8 , haystack [ i ..], needle )) | rel_idx | { return i + rel_idx ; } } return null ; // no substring found }
To properly show the effects of our SIMD algorithm, we’re going to need a large haystack. For this, I’ve chosen to use the entirety Moby Dick in plain text, and a search word ’newsletter’, which appears at the very end of the text.
The code is available on GitHub
To compile the code, I ran zig build with -Doptimize=ReleaseFast :
> zig build -Doptimize = ReleaseFast
Support for bitwise operations on boolean vectors was added in Zig 0.15, which is unreleased as of now (August 2025). If you want to run the code on your system, you need to build Zig from the master branch.
To measure performance and compare against baseline, I’ll use one of my favorite CLI tools, poop:
> poop -d 10000 "./zig-out/bin/substr" "./zig-out/bin/substr --simd"
Benchmark 1 (6361 runs) : ./zig-out/bin/substr measurement mean ± σ min … max outliers delta wall_time 1.22 ms ± 185 us 903 us … 5.33 ms 242 ( 4%) 0% peak_rss 1.20 MB ± 290 1.18 MB … 1.20 MB 2 ( 0%) 0% cpu_cycles 2.15 M ± 40.5 K 2.10 M … 2.71 M 312 ( 5%) 0% instructions 1.85 M ± 0.75 1.85 M … 1.85 M 56 ( 1%) 0% cache_references 43.8 K ± 620 38.3 K … 44.9 K 9 ( 0%) 0% cache_misses 19.0 K ± 10.3 K 4.08 K … 33.6 K 0 ( 0%) 0% branch_misses 48.1 ± 17.4 20 … 104 97 ( 2%) 0% Benchmark 2 (10000 runs) : ./zig-out/bin/substr --simd measurement mean ± σ min … max outliers delta wall_time 500 us ± 96.9 us 397 us … 4.23 ms 840 ( 8%) ⚡ - 58.9% ± 0.4% peak_rss 1.20 MB ± 164 1.18 MB … 1.20 MB 1 ( 0%) + 0.0% ± 0.0% cpu_cycles 369 K ± 36.1 K 340 K … 1.10 M 1167 (12%) ⚡ - 82.8% ± 0.1% instructions 578 K ± 0.53 578 K … 578 K 6 ( 0%) ⚡ - 68.8% ± 0.0% cache_references 38.8 K ± 545 34.1 K … 40.5 K 6 ( 0%) ⚡ - 11.4% ± 0.0% cache_misses 5.62 K ± 4.97 K 2.11 K … 27.9 K 1529 (15%) ⚡ - 70.3% ± 1.2% branch_misses 2.88 K ± 23.4 2.81 K … 3.09 K 453 ( 5%) 💩 +5879.8% ± 1.4%
(Scroll right to see more data)
As you can see, for a large body of text, the speedup is noticeable: 59% faster with 80% less CPU cycles!
The SIMD version only took 500 microseconds to complete on average, including the overhead of loading the program into memory and printing the result. 500 microseconds is half a millisecond. That’s how fast my laptop searches through a whole book, 200 000 words, cover to cover. This is why computers are so powerful! How long would it take for a human to do that?
This is quite a large improvement, and proves that the SIMD code is actually working (otherwise the reduction in CPU cycles wouldn’t be so massive). Can we do even better though?
Character selection #
You may notice from the output of poop that the number of branch misses has absolutely blown up – from 48 on average to 2.88k !
Why does this happen? Well, if you were to count how many times the inner while loop is entered when the mask is non-zero:
var i : usize = 0 ; var count : usize = 0 ; while ( i + k + 32 <= n ) : ( i += 32 ) { const block_first : Block = haystack [ i ..][ 0 .. 32 ]. * ; const block_last : Block = haystack [ i + k - 1 ..][ 0 .. 32 ]. * ; const eq_first = first == block_first ; const eq_last = last == block_last ; var mask : std . bit_set . IntegerBitSet ( 32 ) = .{ . mask = @bitCast ( eq_first & eq_last ) }; while ( mask . findFirstSet ()) | bitpos | { count += 1 ; if ( std . mem . eql ( u8 , haystack [ i + bitpos + 1 ..][ 0 .. k - 1 ], needle [ 1 ..])) { std . debug . print ( "found match with count: {}
" , .{ count }); return i + bitpos ; } _ = mask . toggleFirstSet (); } }
found match with count: 2792
The fact that count is so close to the number of mispredictions suggests that each time the mask is non‑zero we incur a branch miss.
Unfortunately, there is no obvious way to prevent this with the current algorithm. The state-of-the-art seems to be choosing two bytes in the needle that occur less frequently according to a pre-calculated frequency distribution. This is used in the memchr crate in Rust, as explained by the author in this comment on Hacker News.
For example, the needle newsletter has the rarest characters w at index 2 and l at index 4 .
The function in memchr can be found here. I’ve ported it into Zig, and you can see it here.
const needle_index_pair = find_rarest ( needle ); const first_letter : Block = @splat ( needle [ needle_index_pair [ 0 ]]); const first_offset = needle_index_pair [ 0 ]; const second_letter : Block = @splat ( needle [ needle_index_pair [ 1 ]]); const second_offset = needle_index_pair [ 1 ];
The algorithm is the exact same, but the index for first_letter and second_letter now varies according to the pre-calculated frequency distribution.
Benchmark 1 (10000 runs) : ./zig-out/bin/substr --simd measurement mean ± σ min … max outliers delta wall_time 472 us ± 62.9 us 400 us … 1.62 ms 735 ( 7%) 0% peak_rss 1.20 MB ± 0 1.20 MB … 1.20 MB 0 ( 0%) 0% cpu_cycles 376 K ± 44.7 K 347 K … 1.46 M 1213 (12%) 0% instructions 578 K ± 0.54 578 K … 578 K 10 ( 0%) 0% cache_references 38.7 K ± 715 28.3 K … 40.6 K 96 ( 1%) 0% cache_misses 7.37 K ± 5.83 K 2.78 K … 27.7 K 1608 (16%) 0% branch_misses 2.88 K ± 23.4 2.82 K … 3.08 K 415 ( 4%) 0% Benchmark 2 (10000 runs) : ./zig-out/bin/substr --simdv2 measurement mean ± σ min … max outliers delta wall_time 429 us ± 75.5 us 369 us … 3.85 ms 393 ( 4%) ⚡ - 9.1% ± 0.4% peak_rss 1.20 MB ± 0 1.20 MB … 1.20 MB 0 ( 0%) - 0.0% ± 0.0% cpu_cycles 304 K ± 28.4 K 282 K … 1.07 M 1140 (11%) ⚡ - 19.2% ± 0.3% instructions 561 K ± 0.52 561 K … 561 K 5 ( 0%) ⚡ - 2.9% ± 0.0% cache_references 38.7 K ± 610 29.9 K … 40.3 K 25 ( 0%) - 0.1% ± 0.0% cache_misses 5.21 K ± 3.53 K 2.57 K … 27.3 K 1306 (13%) ⚡ - 29.3% ± 1.8% branch_misses 1.07 K ± 14.0 1.02 K … 1.17 K 275 ( 3%) ⚡ - 62.8% ± 0.0%
Comparing to the previous SIMD version, the number of branch misses has dropped by 60%, and it’s 9% faster too. Nice!
The number of branch misses is lower, which can cause faster execution, but I suspect that a much bigger impact is the fact that there are less false positives, which means less byte-by-byte memory accesses and comparisons.
Since AMD Zen 4 and Intel Cannon Lake, there has been a new SIMD instruction set, AVX-512 with 512-bit instructions – double the size of AVX2. I don’t have a computer that has AVX-512 right now, but I suspect that changing the Zig code to process 64 characters at once would lead to even better results.
A smaller haystack #
It’s clear that with a very large haystack, the SIMD version is much faster. But what about a tiny input, like less than a hunder characters?
I did a bit of benchmarking with poop , but I found that I couldn’t accurately measure the speed, since both versions finish extremely very quickly. I decided to use zBench to do a microbenchmark. I decided to use a snippet from Moby Dick as seen here.
+- run test stderr benchmark runs total time time/run (avg ± σ) (min ... max) p75 p99 p995 ----------------------------------------------------------------------------------------------------------------------------- find_substr 100000 424.368ms 4.243us ± 740ns (3.964us ... 107.923us) 4.187us 7.075us 7.245us find_substr_simd_v2 100000 147.883ms 1.478us ± 186ns (1.417us ... 21.354us) 1.483us 1.539us 1.548us
I was surprised to see that even when processing less than a hundred characters, the SIMD algorithm is still faster! The difference between 4μs vs 1μs is extremely small, but it’s slightly faster nonetheless.
As you can see, SIMD can be used to make substring searching dramatically faster, for both very large and very small strings.
But if it’s so much better, then why haven’t I made a pull request to change std.mem.indexOf to use SIMD? Well, the reason is that
std.mem.indexOf is generic over element size, and having a size larger than u8 makes the algorithm much slower The algorithm used in stdmem.indexOf is cross-platform, while the SIMD code wouldn’t be. (not all platforms have SIMD registers at all, Arm has only 128-bit)
Substring searching is rarely the bottleneck in programs, especially ones written in a fast language like Zig. That’s why I don’t personally think it would be worth it to add it to the standard library.
Still, it was great to learn about this advanced optimization technique and see some concrete performance measurements from it!
The full code is available on GitHub here.
Further reading #