We Hit 100% GPU Utilization–and Then Made It 3× Faster by Not Using It

We recently used Qwen3-Embedding-0.6B to embed millions of text documents while sustaining near-100% GPU utilization the whole way. That’s usually the gold standard that machine learning engineers aim for… but here’s the twist: in the time it took to write this blog post, we found a way to make the same workload 3× faster, and it didn’t involve maxing out GPU utilization at all. That story’s for another post, but first, here’s the recipe that got us to near-100%. The workload Here at the Daft kitchen, the same order keeps coming in: “One fast, painless pipeline to get my documents into a vector database for retrieval!” Heard. We whipped up a sample workload that: 1 . Reads millions of text documents from S3 2 . Chunks them into sentences using spaCy 3 . Compute embeddings with the state-of-the-art model Qwen3-Embedding-0.6B 4 . Writes the results to turbopuffer Mise en place Before starting, let’s install the required dependencies: 1 pip install "daft[ray]" turbopuffer torch sentence - transformers spacy accelerate transformers 2 python - m spacy download en_core_web_sm You’ll also need to configure access for the object store where you’ll read data from. We prepared a sample dataset on AWS S3. Import Dependencies and Configure Constants We’ll then set the workload parameters: 1 import torch 2 import daft 3 from daft import col 4 5 NUM_GPU_NODES = 8 6 NLP_MODEL_NAME = "en_core_web_sm" 7 CHUNKING_PARALLELISM = 8 8 EMBEDDING_MODEL_NAME = "Qwen/Qwen3-Embedding-0.6B" 9 ENCODING_DIM = 1024 10 BATCH_SIZE = 512 11 SENTENCE_TRANSFORMER_BATCH_SIZE = 16 These parameters control resource allocation and processing efficiency. Adjust NUM_GPU_NODES based on your cluster size, and modify batch sizes based on your data and available GPU memory. Step 1: Chunk Text When creating embeddings, it's useful to split your text into meaningful chunks. Text is hierarchical and can be broken down at different levels: Document → Sections → Paragraphs → Sentences → Words → Characters. The chunking strategy to use depends on your use case. Chunking Strategies • Sentence-level chunking works well for most use cases, especially when the document structure is unclear or inconsistent. • Paragraph-level chunking is good for RAG (Retrieval-Augmented Generation) applications where maintaining context across sentences is important. • Section-level chunking is useful for long documents that have clear structural divisions. • Fixed-size chunks are simple to implement but may break semantic meaning at arbitrary boundaries. When to Use Each Approach • Sentence splitting is the default choice when you're unsure about the document structure or when working with diverse content types. • Paragraph splitting is preferred for RAG systems where maintaining context across multiple sentences matters for retrieval quality. • Custom splitting is necessary for specialized content like tweets, text messages, or code that don't follow standard paragraph structures. Implementation We'll use sentence-level chunking in this example. We'll also use spaCy, which is a natural language processing library that provides robust sentence boundary detection that handles edge cases better than simple punctuation-based splitting. 1 2 3 chunked_type = daft . DataType . list ( 4 daft . DataType . struct ( { 5 "text" : daft . DataType . string ( ) , 6 "chunk_id" : daft . DataType . int32 ( ) 7 } ) 8 ) 9 10 @daft . udf ( 11 return_dtype = chunked_type , 12 concurrency = NUM_GPU_NODES * ( CHUNKING_PARALLELISM + 1 ) , 13 batch_size = BATCH_SIZE // CHUNKING_PARALLELISM // 2 14 ) 15 class ChunkingUDF : 16 def __init__ ( self ) : 17 import spacy 18 self . nlp = spacy . load ( NLP_MODEL_NAME ) 19 20 def __call__ ( self , text_col ) : 21 results = [ ] 22 for text in text_col : 23 doc = self . nlp ( text ) 24 sentence_texts = [ 25 { "text" : sentence . text , "chunk_id" : i } 26 for i , sentence in enumerate ( doc . sents ) 27 ] 28 results . append ( sentence_texts ) 29 return results This User-Defined Function (UDF): • Loads the spaCy model once per UDF during initialization for efficiency • Processes batches of text ( text_col ) to minimize overhead • Returns a list of sentence chunks with unique chunk IDs • Runs multiple instances in parallel (NUM_GPU_NODES * CHUNKING_PARALLELISM = 64 total instances) for distributed processing Step 2: GPU-Accelerated Embedding Generation Choosing a Text Embedding Model The quality of your embeddings depends heavily on the model you choose. Here are some key considerations: Model Performance • MTEB Leaderboard : Check the Massive Text Embedding Benchmark (MTEB) leaderboard for the latest performance rankings across various tasks • Task-specific performance : Different models excel at different tasks (semantic search, clustering, classification, etc.) • Multilingual support : Consider if you need to process text in multiple languages • Language-specific tasks: If you only need to support a single language, it could be helpful to look at model performance for that specific language instead of multilingual benchmarks Some Popular Models • Qwen3-Embedding-0.6B : Good performance-to-size ratio, state-of-the-art, used in this example • all-MiniLM-L6-v2 : The default used in Sentence Transformer's documentation, often used in tutorials • gemini-embedding-001 : The current top multilingual model on MTEB. Requires Gemini API access • Seed1.6-Embedding: The current top model on the Chinese MTEB leaderboard. Requires Volcengine API access With open models available on HuggingFace , you can easily swap models by changing the EMBEDDING_MODEL_NAME constant in the code below. We'll create a UDF to generate embeddings from the chunked text: 1 # Define the return type for embeddings 2 embedding_type = daft . DataType . embedding ( daft . DataType . float32 ( ) , ENCODING_DIM ) 3 4 @ daft . udf ( 5 return_dtype = embedding_type , 6 concurrency = NUM_GPU_NODES , 7 num_gpus = 1 , 8 batch_size = BATCH_SIZE 9 ) 10 class EncodingUDF : 11 def __init__ ( self ) : 12 from sentence_transformers import SentenceTransformer 13 14 device = 'cuda' if torch . cuda . is_available ( ) else 'cpu' 15 self . model = SentenceTransformer ( EMBEDDING_MODEL_NAME , device = device ) 16 self . model . compile ( ) 17 18 def __call__ ( self , text_col ) : 19 embeddings = self . model . encode ( 20 text_col . to_pylist ( ) , 21 batch_size = SENTENCE_TRANSFORMER_BATCH_SIZE , 22 convert_to_tensor = True , 23 torch_dtype = torch . bfloat16 , 24 ) 25 return embeddings . cpu ( ) . numpy ( ) This UDF: • Loads the SentenceTransformer model on GPU if available • Uses bfloat16 precision to reduce memory usage • Processes text in batches ( SENTENCE_TRANSFORMER_BATCH_SIZE = 128 ) for optimal GPU utilization • Returns numpy arrays which are compatible with Daft Step 3: Configure Distributed Processing You can run this script locally, but if you're interested in running this pipeline on a cluster, check out our guide on scaling up. In this example, we ran on a ray cluster with 8 g5.2xlarge workers (each comes with an A10G GPU). To configure our Daft script to use the ray cluster, we added: 1 2 daft . context . set_runner_ray ( ) 3 4 5 daft . set_planning_config ( 6 default_io_config = daft . io . IOConfig ( 7 s3 = daft . io . S3Config . from_env ( ) 8 ) 9 ) Step 4: Execute the Pipeline Now we'll execute the complete data processing pipeline: 1 ( 2 daft . read_parquet ( "s3://desmond-demo/text-embedding-dataset.parquet" ) 3 . with_column ( "sentences" , ChunkingUDF ( col ( "text" ) ) ) 4 . explode ( "sentences" ) 5 . with_column ( "text" , col ( "sentences" ) [ "text" ] ) 6 . with_column ( "chunk_id" , col ( "sentences" ) [ "chunk_id" ] ) 7 . exclude ( "sentences" ) 8 . with_column ( "embedding" , EncodingUDF ( col ( "text" ) ) ) 9 . with_column ( 10 "id" , 11 col ( "url" ) . str . right ( 50 ) + "-" + col ( "chunk_id" ) . cast ( daft . DataType . string ( ) ) 12 ) 13 . select ( "id" , "url" , "language" , "source" , "text" , "embedding" ) 14 . write_turbopuffer ( 15 namespace = "desmond-scale-experiment6" , 16 region = "aws-us-west-2" , 17 id_column = "id" , 18 vector_column = "embedding" , 19 distance_metric = "cosine_distance" 20 ) 21 ) Pipeline steps explained: 1 . Read data: Load Parquet files from S3 with large chunk size for efficiency 2 . Chunk text: Apply sentence splitting UDF 3 . Explode: Flatten the list of sentences into separate rows 4 . Extract fields: Get text and chunk_id from the sentence structs 5 . Generate embeddings: Apply embedding UDF to text 6 . Create IDs: Generate unique IDs combining URL and chunk_id 7 . Select columns: Keep only the necessary columns 8 . Write to Turbopuffer: Store data and vectors in Turbopuffer If all works out well, when you run this script on your cluster, you should notice that network I/O, CPU work, and GPU work are pipelined to run in parallel, and you should see high GPU utilization :) Customization Tips • Adjust batch sizes : Increase SENTENCE_TRANSFORMER_BATCH_SIZE for better throughput, decrease for lower GPU memory usage • Scale workers : Modify NUM_GPU_NODES and CHUNKING_PARALLELISM based on your cluster size and cores available per node • Change models : Replace EMBEDDING_MODEL_NAME with other SentenceTransformer models • Different chunking : Modify ChunkingUDF to use different text chunking strategies • Alternative vector databases: Replace with other vector databases like Lance, Pinecone, or Chroma Performance Considerations • GPU memory : Monitor GPU memory usage and adjust batch sizes accordingly. If your GPUs fail to allocate sufficient memory or you exceed the max sequence length of your embedding model, SENTENCE_TRANSFORMER_BATCH_SIZE may be too large • Model loading : UDFs load models once per worker, so initialization time is amortized • Quantization: Use bfloat16 or float16 quantization for lower GPU memory utilization and higher throughput. This pipeline can efficiently process millions of text documents while automatically scaling across your available compute resources. What’s next on the menu? With this recipe, we hit near-100% GPU utilization—a benchmark that’s the holy grail for many. But the Daft kitchen never stops cooking. Since then, we’ve been experimenting with new ingredients and techniques—custom GPU pipelining, swapping Sentence Transformers for vLLM—that have made the whole meal cook 3× faster. We’re still plating that next dish, and trust us, it’s worth the wait. Keep an eye out for the upcoming blog where we’ll share how we turned up the heat and pushed throughput beyond the peak-utilization grind. Until then, happy embedding! And remember: we don’t sell the GPUs, we sell the sizzle.

We Hit 100% GPU Utilization–and Then Made It 3× Faster by Not Using It

Share this article

Related Articles