Tokasaurus: An LLM inference engine for high-throughput workloads
Published on: 2025-06-09 05:27:07
Tokasaurus: An LLM Inference Engine for High-Throughput Workloads
TL;DR
We’re releasing Tokasaurus, a new LLM inference engine optimized for throughput-intensive workloads. With small models, Tokasaurus benefits from very low CPU overhead and dynamic Hydragen grouping to exploit shared prefixes. For larger models, Tokasaurus supports async tensor parallelism for GPUs with NVLink and a fast implementation of pipeline parallelism for GPUs without. On throughput-focused benchmarks, Tokasaurus can outperform vLLM and SGLang by up to 3x+.
Table of Contents
Intro
As LLMs get smarter, faster, and cheaper, the community keeps finding new ways to use them. Our own recent work has explored using models to scan every file in a codebase, sample 10,000 attempts for math and code problems, and collaborate with other models to minimize cloud costs. Inference is now also an important part of the training process, where we use models to generate synthetic data or as part of RL pipelines that gener
... Read full article.