How fast is N tokens per second really?

2026-05-18 | original

read original get NVIDIA A100 GPU → more articles

Why This Matters

This article emphasizes the importance of understanding actual token streaming speeds in local and cloud-based language models, highlighting that raw throughput numbers can be misleading without context. It introduces different modes of token streaming and demonstrates how content type affects perceived speed, urging developers and consumers to consider real-world performance rather than just benchmark figures. Recognizing these nuances is crucial for optimizing AI applications and setting realistic expectations for end-users. The article encourages experimenting with various settings to better grasp the true speed of language models in different scenarios.

Key Takeaways

Token throughput varies significantly based on content type and mode.
Benchmark numbers alone can be misleading without considering perceptual differences.
Experimenting with different settings helps understand real-world model performance.

How fast is 10 tokens per second really?

c code t text h think a agent

1 5 2 10 3 20 4 30 5 60 6 100 7 200 8 400 9 800

Think length 5 sentences

Use as text Use as code Clear Upload file

▍

Every local-LLM benchmark reports throughput: "47 tok/s on an M3," "180 tok/s on a 4090," "500 tok/s on Groq." Unless you've actually watched tokens stream at those rates, the numbers are hard to internalize. This is the rendering.

Four modes

code — syntax-highlighted pseudo-code, the most common thing you watch stream out of an LLM.

— syntax-highlighted pseudo-code, the most common thing you watch stream out of an LLM. text — lorem ipsum prose, for the chat/answer case.

... continue reading

Explore topics: nvidia groq cerebras raspberry pi gpt