ICU gets Unicode right and pays for it. This post shows a different approach: fold-safe windows, SIMD probes, and verifiers for fast UTF‑8 search.
This article is about the ugliest, but potentially most useful piece of open-source software I’ve written this year. It’s messy, because UTF-8 is messy. The world’s most widely used text encoding standard was introduced in 1989. It now covers more than 1 million characters across the majority of used writing systems, so it’s not exactly trivial to work with.
The example above contains multiple confusable characters: German Eszett variants 'ß' U+00DF 0x C3 9F and 'ẞ' U+1E9E 0x E1 BA 9E , the Kelvin sign 'K' U+212A 0x E2 84 AA and ASCII 'k' U+006B 0x 6B , and Greek mu 'μ' U+03BC 0x CE BC vs the micro sign 'µ' U+00B5 0x C2 B5 . Try guessing which is which and how they are encoded in UTF-8!
That’s why ICU exists - pretty much the only comprehensive open-source library for Unicode and UTF-8 handling, powering Chrome/Chromium and probably every OS out there. It’s feature-rich, battle-tested, and freaking slow. Now StringZilla makes some of the most common operations much faster, leveraging AVX-512 on Intel and AMD CPUs!
Namely:
Tokenizing text into lines or whitespace-separated tokens, handling 25 different whitespace characters and 9 newline variants; available since v4.3; 10× faster than alternatives. Case-folding text into lowercase form, handling all 1400+ rules and edge cases of Unicode 17 locale-agnostic expansions, available since v4.4; 10× faster than alternatives. Case-insensitive substring search bypassing case-folding for both European and Asian languages, available since v4.5; 20–150× faster than alternatives. Or 20,000× faster, if we compare to PCRE2 RegEx engine with case-insensitive flag!
I’d like to stress that this is not just about “throughput” or “speed” - it’s about “correctness” as well! Some experimental projects try applying vectorization to broader text processing tasks, but the vast majority are limited to ASCII or ignore all the edge cases of Unicode to achieve speedups. StringZilla, however, is tested against a synthetic suite generated on the fly from the most recent Unicode specs, so when the 18.0 of the standard comes out, updating the library to support it should be trivial. It’s also tested against ICU on real-world data to keep it correct in typical cases. Let’s dive into the details of how this was achieved.
Today, almost all of the Internet is UTF-8. Its share grew from 50% in 2010 to 98% in 2024. The remaining ~2% is mostly legacy content in:
“ISO-8859-1” or “Latin-1” — older Western European sites
“Windows-1252” — legacy Windows encoding
... continue reading