Skip to content
Tech News
← Back to articles

NumKong: 2'000 Mixed Precision Kernels for All

read original get NVIDIA CUDA Toolkit → more articles
Why This Matters

NumKong represents a significant advancement in SIMD kernel collections, offering over 2,000 mixed-precision numerics optimized across multiple architectures and programming languages. Its extensive library enhances computational performance and flexibility for high-performance computing, AI, and scientific applications, making advanced numerics more accessible and efficient for developers and researchers alike.

Key Takeaways

Over 2'000 SIMD kernels for mixed-precision BLAS-like numerics packaged for 7 programming languages — from Float6 to Float118, on RISC-V, Intel AMX, AVX2 & AVX-512 on x86, Arm SME & SVE, and Relaxed WASM SIMD in 5 MB or less.

These are a few lines of celebratory “proud-dad” rumblings and highlights from my largest open-source release to date. I’m killing my SimSIMD project and re-launching under a new name — NumKong — StringZilla’s big brother. Over 2'000 SIMD kernels for mixed precision numerics, spread across 200'000 lines of code & docstrings, in 7 languages. One of the largest collections online — pretty much the same size as OpenBLAS, the default NumPy BLAS (Basic Linear Algebra Subprograms) backend (detailed comparison below).

What’s inside?

RISC-V Vector Extensions, Intel AMX & Arm SME Tiles

From Vectors to Matrices and Higher-rank Tensors

From BFloat16 and Float16 to Float6 — E3M2 & E2M3 on any CPU

Native Int4 & UInt4 Dot Products via Nibble Algebra

Neumaier & Dot2 for higher-than-BLAS precision

Ozaki Scheme for Float64 GEMMs via Float32 Tile Hardware

Haversine & Vincenty for Geospatial — 5'300x faster than GeoPy

... continue reading