Prompt caching for cheaper LLM tokens

Sam Rose is a Senior Developer Educator at ngrok, focusing on creating content that helps developers get the most out of ngrok.

As I write this post, cached input tokens are 10x cheaper in dollars per token than regular input tokens for both OpenAI and Anthropic's APIs.

Anthropic even claim that prompt caching can reduce latency "by up to 85% for long prompts" and in my own testing I found that for a long enough prompt, this is true. I sent hundreds of requests to both Anthropic and OpenAI and noticed a substantial reduction in time-to-first-token latency for prompts where every input token was cached.

OpenAI logo GPT-5 Anthropic logo Sonnet 4.5

Now that I've hooked you in with fancy gradient text and pretty charts, have you ever asked yourself...

What on earth is a cached token?

What's going on in those vast oceans of GPUs that enables providers to give you a 10x discount on input tokens? What are they saving between requests? It's not a case of saving the response and re-using it if the same prompt is sent again, it's easy to verify that this isn't happening through the API. Write a prompt, send it a dozen times, notice that you get different responses each time even when the usage section shows cached input tokens.

Not satisfied with the answers in the vendor documentation, which do a good job of explaining how to use prompt caching but sidestep the question of what is actually being cached, I decided to go deeper. I went down the rabbit hole of how LLMs work until I understood the precise data providers cache, what it's used for, and how it makes everything faster and cheaper for everyone.

By the end of this post you will... Understand, at a deeper level, how LLMs work

Have built some new intuition for why LLMs work the way they do

... continue reading