Processing Strings 109x Faster Than Nvidia on H100

I’ve just shipped StringZilla v4, the first CUDA-capable release of my SIMD-first string processing library. Which in English means that it is now fast not only on CPUs, but also on GPUs!

I’ve wanted to add ROCm-acceleration for AMD GPUs 🤦‍♂️

I’ve wanted to include a parallel multi-pattern search algorithm 🤦‍♂️

I’ve wanted to publish it back in December 2024 🤦‍♂️

So not everything went to plan, but “StringZilla 4 CUDA” is finally here, bringing 500+ GigaCUPS of edit-distance calculations in a pip install -able package, and a few more tricks up its sleeve, aimed at large-scale Information Retrieval, Databases and Datalake systems, as well as Bioinformatics workloads. All under a permissive Apache 2.0 open-source license, free for commercial use. So in this post, we’ll cover some of the most interesting parts of this release, including:

Fast evaluation of dynamic-programming algorithms on GPUs ,

, Hashing beyond CRC32 , MurMurHash , xxHash , and aHash , and

, , , and , and Fingerprinting biological sequences with 52-bit integers?!

Background & Inspiration#

Historically, StringZilla started as conference talk material in the late 2010s, showcasing the power of AVX-512 and the intricacies of vectorizing non-data-parallel workloads (… pretty much the opposite of my SimSIMD). Over the years, it expanded from a few substring search kernels into a beast competing with GLibC for the fastest memcpy (yes, I know it’s a popular claim). It later added support for little- & big-endian platforms; several generations of AVX-512 on x86, Arm NEON, SVE, and SVE2 extensions; dynamic dispatch; and first-party bindings for Python, Rust, JavaScript, and even Swift, translating the underlying C99 implementation.

... continue reading