Now on the front page of Hacker News — join the discussion.
In a recent post, we introduced the Tau² benchmark, a framework for benchmaring LLMs. Today we’re sharing a surprising discovery we made while using it: a simple prompt rewrite boosted a small model’s success rate by over 20%. This post is a deep-dive on how we found and fixed this performance bottleneck by making subtle changes to agent policies.
Benchmarking LLMs with Tau²
On the recent OpenAI Summer Update, we have seen that GPT-5 model has made significant strides in agentic tasks. To validate these claims, they’ve turned to the Tau² benchmark, which simulates real-world agent interactions across various domains like telecom, retail, and airlines.
Before moving any further, we have to establish that GPT-5 showed significant improvement only in one benchmark domain - which is Telecom. The other ones have been somehow overlooked during model presentation - therefore we won’t bother about them either (😉).
In agentic interactions, accuracy is non-negotiable, but model speed is equally vital for user experience. Therefore, it makes sense to consider alternatives to flagship models, such as the recently introduced GPT-5-mini.
GPT-5-mini offers significant advantages: it’s roughly twice as fast in latency and noticeably more efficient in throughput. While delivering 85–95% of the full GPT-5’s performance, it is also five times cheaper.
Therefore, we ran an experiment to explore two things:
How well GPT-5-mini performs on this benchmark.
Whether we can improve its results by making subtle changes to the domain, such as modifying agent policies or task descriptions.
... continue reading