Pop quiz: at what point in the context length of a coding agent are cached reads costing you half of the next API call? By 50,000 tokens, your conversation’s costs are probably being dominated by cache reads.
Let’s take a step back. We’ve previously written about how coding agents work: they post the conversation thus far to the LLM, and continue doing that in a loop as long as the LLM is requesting tool calls. When there are no more tools to run, the loop waits for user input, and the whole cycle starts over. Visually:
The agentic loop
Or, in code form:
def loop(llm): msg = user_input() while True: output, tool_calls = llm(msg) print("Agent: ", output) if tool_calls: msg = [handle_tool_call(tc) for tc in tool_calls] else: msg = user_input()
The LLM providers charge for input tokens, cache writes, output tokens, and cache reads. It's a little tricky: you indicate in your prompt to cache up to a certain point (usually the end), and you get charged as “cache write” and not input. The previous turn's output becomes the next turn's cache write. Visually:
Token costs across LLM calls
Here, the colors and numbers indicate the costs making up the nth call to the LLM. Every subsequent call reads the story so far from the cache, writes the previous call’s output to the cache (as well as any new input), and gets an output. The area represents the cost, though in this diagram, it's not quite drawn to scale. Add up all the rectangles, and that's the total cost.
That triangle emerging for cache reads? That's the scary quadratic!
How scary is the quadratic? Pretty squarey! I took a rather ho-hum feature implementation conversation, and visualized it like the diagram above. The area corresponds to cost: the width of every rectangle is the number of tokens and the height is the cost per token. As the conversation evolves, more and more of the cost is the long thin lines across the bottom that correspond to cache reads.
... continue reading