Boosting multimodal inference performance by >10% with a single Python dict

tl;dr: Multimodal models are promising, but inference engines haven't been optimized for them yet. We profiled SGLang’s scheduler on a multimodal workload and identified an opportunity to replace expensive book-keeping around shared GPU memory with a simple cache lookup. Throughput and latency both improved over 10% on our target workload. The improvement is merged in SGLang v0.5.10.

Metric Handle Cache OFF Handle Cache ON Improvement Throughput (req/s) 22.2 25.7 +16.2% TTFT mean (ms) 965 838 -13.2% TPOT mean (ms) 72 60 -17.2%

Multimodal vision-language models (VLMs) give artificial intelligence eyes. Our users deploy smaller VLMs for efficient parsing of unstructured documents and large ones to power multimodal coding agents who can see the apps they are designing.

These new input types and new models pose new challenges for open source inference engines, like SGLang and vLLM . And one of the most stubborn challenges is maximizing performance -- solved here, as always, only by a relentless grind, one small improvement at a time.

This blog post tells the story of one of those humble changes.

While working with a customer, we were benchmarking Qwen2.5-VL-3B-Instruct on H100s, and we noticed that SGLang’s throughput had plateaued well below what the GPU could handle. The solution was to remember the "golden rule" of inference performance engineering: never block the GPU.

Identifying host overhead

When you notice an inference performance issue, stop yourself from going CUDA MODE and scrutinizing warp stall reasons in Nsight Compute -- you can even put down the Torch Profiler ! Check the easy things first: what is happening on the host and why isn't it faster?

In (V)LM inference engines like SGLang, the scheduler is the key host-side component and potential bottleneck -- a single-threaded loop that gates submission of work the GPU.

Every millisecond spent in the scheduler is a millisecond that prefill and decode iterations are stalled for all in-flight requests . We’ve said it before : host overhead will kill your inference efficiency.

... continue reading