Skip to content
Tech News
← Back to articles

How fast is N tokens per second really?

read original get NVIDIA A100 GPU → more articles
Why This Matters

This article emphasizes the importance of understanding actual token streaming speeds in local and cloud-based language models, highlighting that raw throughput numbers can be misleading without context. It introduces different modes of token streaming and demonstrates how content type affects perceived speed, urging developers and consumers to consider real-world performance rather than just benchmark figures. Recognizing these nuances is crucial for optimizing AI applications and setting realistic expectations for end-users. The article encourages experimenting with various settings to better grasp the true speed of language models in different scenarios.

Key Takeaways

How fast is 10 tokens per second really?

c code t text h think a agent

1 5 2 10 3 20 4 30 5 60 6 100 7 200 8 400 9 800

Think length 5 sentences

Use as text Use as code Clear Upload file

Every local-LLM benchmark reports throughput: "47 tok/s on an M3," "180 tok/s on a 4090," "500 tok/s on Groq." Unless you've actually watched tokens stream at those rates, the numbers are hard to internalize. This is the rendering.

Four modes

code — syntax-highlighted pseudo-code, the most common thing you watch stream out of an LLM.

— syntax-highlighted pseudo-code, the most common thing you watch stream out of an LLM. text — lorem ipsum prose, for the chat/answer case.

... continue reading