The Lie
We have all been there. You build an agent. It works perfectly in the demo. You deploy it. And then, on a Tuesday at 3 PM, it decides that the URL for the API documentation is api.stripe.com/v1/users (a 404), but it looks so plausible that you waste 20 minutes debugging network errors.
Worse, it says this with 100% confidence.
When we try to fix this today, the industry tells us to use “LLM-as-a-Judge.” We are told to ask GPT-4o to grade GPT-3.5. We are told to fix the “vibes.”
But this creates a dangerous circular dependency. If the underlying models suffer from sycophancy (agreeing with the user) or hallucination, a Judge model often hallucinates a passing grade.
We are trying to fix probability with more probability. That is a losing game.
Code > Vibes
I believe we need to stop treating Agents like magic boxes and start treating them like software. Software has assertions. Software has unit tests. Software has return False .
We need to re-introduce Determinism into the stack.
Don’t ask an LLM if a URL is valid. It will hallucinate a 200 OK. Run requests.get() .
... continue reading