Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

Day zero model performance optimization work is a mix of experimentation, bug fixing, and benchmarking guided by intuition and experience. This writeup outlines the process we followed to achieve SOTA latency and throughput for GPT OSS 120B on NVIDIA GPUs at launch with the Baseten Inference Stack.

The day an open source model like OpenAI’s new gpt-oss-120b is released, we race to make the model as performant as possible for our customers. As a launch partner for OpenAI’s first open-source LLM since 2019, we wanted to give developers a great experience with the new LLMs.

By the end of launch day, we were the clear leader running on NVIDIA GPUs for both latency and throughput per public data from real-world use on OpenRouter.

✕ What matters is having the inference optimization muscle to immediately push on latency and throughput

Optimizing performance on a new model is a substantial engineering challenge. Thanks to our flexible inference stack and the collective expertise of our model performance engineering team, we are able to roll out performance improvements by the hour on new models.

In fact, in the time it took to write this blog post, we added another 100 tokens per second while maintaining 100% uptime.

✕ OpenRouter performance for GPT OSS, 6:45 PM August 6, 2025

And we added another 100 tokens per second in the time it took the post to hit #1 on Hacker News.

✕ OpenRouter performance for GPT OSS, 9:45 PM August 6, 2025

Model performance efforts included:

... continue reading