We Hit 100% GPU Utilization–and Then Made It 3× Faster by Not Using It

We recently used Qwen3-Embedding-0.6B to embed millions of text documents while sustaining near-100% GPU utilization the whole way.

That’s usually the gold standard that machine learning engineers aim for… but here’s the twist: in the time it took to write this blog post, we found a way to make the same workload 3× faster, and it didn’t involve maxing out GPU utilization at all. That story’s for another post, but first, here’s the recipe that got us to near-100%.

The workload

Here at the Daft kitchen, the same order keeps coming in: “One fast, painless pipeline to get my documents into a vector database for retrieval!”

Heard.

We whipped up a sample workload that:

1 . Reads millions of text documents from S3 2 . Chunks them into sentences using spaCy 3 . Compute embeddings with the state-of-the-art model Qwen3-Embedding-0.6B 4 . Writes the results to turbopuffer

Mise en place

Before starting, let’s install the required dependencies:

1 pip install "daft[ray]" turbopuffer torch sentence - transformers spacy accelerate transformers 2 python - m spacy download en_core_web_sm

... continue reading