When we shipped Auto Embeddings — the feature that turns any text column into a vector automatically, with no separate model service to run — the most common piece of feedback was about speed. The previous path went through SentenceTransformers on top of Candle , Hugging Face's pure-Rust ML inference runtime, and it left a lot of CPU on the floor: most workloads sat in the low-double-digits of docs/sec no matter how we fed them, and concurrent calls serialised on a single model session.
So we spent a few weeks rebuilding how Manticore runs ONNX models. The new ONNX Runtime backend shipped in Manticore Search 27.1.5 . ONNX (Open Neural Network Exchange) is the portable model format that most of the popular open-source embedding models — MiniLM, BGE, E5, and friends — already publish. The result is a backend that's ~14× faster on average than the previous SentenceTransformers/Candle path on the same hardware (average cheap 16 cores / 32 threads server), same model, same weights, averaged over the full threads × batch workload grid — and that advantage holds whether you run 1 client thread or 32. The old path stayed in the 5–11 docs/sec range across the entire grid; the new one lives in the 70–230 docs/sec band.
This post is the engineering log: what we tried, what surprised us, what we threw away, and what the final design looks like.
TL;DR
~14× faster on average than the previous SentenceTransformers/Candle path , averaged across the full threads × batch workload grid (1 / 2 / 4 / 8 / 16 / 32 threads × batch sizes 1…128) on the same box (16 cores / 32 threads), same model, same weights.
, averaged across the full workload grid (1 / 2 / 4 / 8 / 16 / 32 threads × batch sizes 1…128) on the same box (16 cores / 32 threads), same model, same weights. Released in Manticore Search 27.1.5 , the new ONNX path is now the default fast path for any HuggingFace model that ships an .onnx file.
file. On all-MiniLM-L12-v2 , the old Candle path sat at 5–11 docs/sec across every configuration we tried. The new ONNX path lands in the 70–230 docs/sec range — the same ~14× margin holds whether you run 1 client thread or 32 .
, the old Candle path sat at across every configuration we tried. The new ONNX path lands in the range — the . Single-insert latency on our test box: ~14 ms with a single client, ~56 ms under 8-way concurrent load — both well below the 200+ ms Candle was hitting.
with a single client, under 8-way concurrent load — both well below the 200+ ms Candle was hitting. Want maximum bulk ingest throughput? Use a high batch size (32–128) on a single client thread . The new backend parallelises inside the call, so client-side fan-out just piles coordination overhead on top — peak on our box was 233 docs/sec at 1 thread + batch=64 .
Use a (32–128) on a . The new backend parallelises inside the call, so client-side fan-out just piles coordination overhead on top — peak on our box was . The two changes that mattered most: turning intra_op_spinning off , and giving up on batching documents inside the worker.
... continue reading