Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

In a recent post, we introduced the Tau² benchmark, a framework for benchmaring LLMs. Today we’re sharing a surprising discovery we made while using it: a simple prompt rewrite boosted a small model’s success rate by over 20%. This post is a deep-dive on how we found and fixed this performance bottleneck by making subtle changes to agent policies. Benchmarking LLMs with Tau² On the recent OpenAI Summer Update, we have seen that GPT-5 model has made significant strides in agentic tasks. To validate these claims, they’ve turned to the Tau² benchmark, which simulates real-world agent interactions across various domains like telecom, retail, and airlines. Before moving any further, we have to establish that GPT-5 showed significant improvement only in one benchmark domain - which is Telecom. The other ones have been somehow overlooked during model presentation - therefore we won’t bother about them either (😉). In agentic interactions, accuracy is non-negotiable, but model speed is equally vital for user experience. Therefore, it makes sense to consider alternatives to flagship models, such as the recently introduced GPT-5-mini. GPT-5-mini offers significant advantages: it’s roughly twice as fast in latency and noticeably more efficient in throughput. While delivering 85–95% of the full GPT-5’s performance, it is also five times cheaper. Therefore, we ran an experiment to explore two things: How well GPT-5-mini performs on this benchmark. Whether we can improve its results by making subtle changes to the domain, such as modifying agent policies or task descriptions. Baseline: Expect GPT-5-mini to Fail 45% of the Time Firstly, we’re going to establish the benchmark for the GPT-5-mini model. As the telecom benchmark contains over 100 tests, we’ll use their subset. Luckily, the telecom_small task set comes in handy with just 20 test scenarios. Running the benchmark with: tau2 run \ --domain telecom \ --agent-llm gpt-5-mini \ --user-llm gpt-5-mini \ --num-trials 2 --task-set-name telecom_small Our results are: We ended up running 40 simulations: The initial success rate was low: just 55%. The GPT-5-mini with its limited reasoning capabilities doesn’t even get close to flagship GPT-5. There’s an additional interesting metric this benchmark has introduced, which is pass^k. This measures how well an agent can perform when it’s challenged with the same task k times. I like to think of it as the reliability of the AI Agent. Another intriguing aspect of this benchmark are tasks which failed for all given trials - which could imply that the AI Agent is simply not capable of handling at all. This can happen due to multiple factors - reasoning might be too difficult, user ask could not be specific enough, etc. The Hack: Using Claude to Rewrite Prompts for GPT-5-mini When hacking the AI Agent for GPT-5-mini model we aimed to answer three questions: Can I improve the overall success rate when using this limited model? Are we able to “unlock” more tasks which agents are capable of handling? Does agent reliability improve? The beauty of generative AI is that we could offload lots of work and allow ourselves a little laziness. Therefore, we asked Claude to analyze the AI agent policies in the telecom domain, which are the building blocks for the agent prompt. From there, we asked specifically to assume that these policies are going to feed an AI Agent prompt, which will be using faster, non-flagship LLM. Output documents ( main_policy.md and tech_support_manual.md ) will have to be rephrased so that they are easier to follow for the agent and result in improving its behaviour. Here’s the “executive summary” of improvements it made: I've created two AI-optimized versions of the telecom domain documentation: Key Improvements Made: 1. Structure & Flow - Decision Trees: Clear branching logic with ├── and └── notation - Sequential Steps: Numbered, ordered procedures instead of scattered explanations - Prerequisites: Explicit dependency checks before proceeding 2. AI Agent Optimizations - Tool Call Clarity: Exact function names and parameters - Binary Decisions: Clear yes/no conditions instead of ambiguous language - Error Handling: Specific failure conditions and next steps - Verification Steps: "Recheck" instructions after each fix 3. Cognitive Load Reduction - Reference Tables: Quick lookup for tools and purposes - Pattern Recognition: Common issue combinations and their solutions - Critical Reminders: Common AI mistakes section to prevent errors 4. Actionable Language - Removed verbose explanations mixed with instructions - Consolidated multiple documents' logic into single workflows - Used imperative commands: "Check X", "If Y then Z" - Added immediate verification steps We’ve moved from lengthy policy declarations to clear, directive instructions. Instead of wordy statements, the new style uses step-by-step logic, conditions, and patterns. For prompts, this works more like a checklist — “Check X → If Y, then Z” — rather than vague, descriptive policies. The Result: 22% Accuracy Boost and 50% Fewer Unsolvable Tasks Let’s review what our improved AI agent results look like: The new prompts led to a significant performance boost. Pass^k metrics surged: k=1 from 0.55 to 0.675 ( a 22.73% improvement ) → In plain terms, GPT-5-mini now succeeds on 67.5% of tasks instead of 55% . ) → In plain terms, GPT-5-mini now succeeds on . k=2 from 0.4 to 0.5 (a 25% improvement) → Meaning retries became more effective too. For context, flagship GPT-5 scores ~97% on this benchmark, o3 comes in at 58%, and GPT-4.1 at 34%. With our optimized prompts, GPT-5-mini not only jumped well above its own baseline but also outperformed o3, landing much closer to GPT-5 than before. The side-by-side comparison shows exactly where the gains came from. On the left side of the screen you’ll see the “stock” AI agent results, on the right - our AI agent improved for GPT-5-mini. The screenshot above outlines that with our updated prompts and policies, we managed to “unlock” some of the tests which were previously always failing due to GPT-5-mini’s limited capabilities. Now there are only 3 tasks, which the agent didn’t manage to solve at all within the given 2 trials - compared to 6. Key Takeaways for Your Own Models This experiment shows that thoughtful prompt design can meaningfully boost the performance of smaller models like GPT-5-mini. By restructuring policies into clear, step-by-step instructions, we not only improved success rates but also “unlocked” tasks that previously seemed unsolvable for the model. The key was in simplifying language, reducing ambiguity, and breaking down reasoning into explicit, actionable steps. Smaller models struggle with long-winded or fuzzy policies, but thrive when given structured flows, binary decisions, and lightweight verification steps. The takeaway is clear: using a frontier model to automatically optimize prompts can unlock major improvements for smaller LLMs. With strategic optimization, lightweight models can deliver decent results at a fraction of the cost — making them a compelling alternative when efficiency and affordability matter as much as accuracy. If you found this helpful, let us know! Prompt engineering is still an open playground, and we’re excited to see what creative approaches others are exploring in this space.

Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

Share this article

Related Articles