We decreased our LLM costs with Opus

Last week we wrote about feeding terabytes of CI logs to an LLM. Most of the questions on Hacker News weren't about the logs. They were about the agent: which models, how they coordinate, and how much it all costs.

Today we run Opus 4.6 and pay less than when we ran everything on Sonnet 4.0.

The reason is mostly what Opus doesn't do: 80% of failures never reach it, and when they do, it never reads a log line.

The architecture looks like this:

Let a cheap agent decide if the expensive one is needed

Last week we analyzed around 4,000 CI failures. 818 were new problems. The other 3,187 were a known issue surfacing again: a flaky test, an infrastructure hiccup, a network blip we'd already detected.

It makes no sense to wake up an expensive model when 80% of the time the answer is "it's a duplicate". Unfortunately, we can't deterministically detect duplicates: the same job can fail multiple times for completely different reasons, so you need to actually look at the logs to know if you've seen this before.

We initially used Sonnet for this to balance cost and performance. It worked, but it was the worst of both worlds: still expensive, and the results weren't as good as a frontier model.

We switched to the "triager" pattern: a Haiku agent with a very specific and narrow job. Is this issue already tracked or not? If it is, stop right there. If not, escalate to Opus.

Detecting duplicates with Haiku proved a bit challenging. We needed to make the job as easy as possible, so we attached error messages to previous failures and gave Haiku two search tools: exact matching for known error snippets, and semantic search (pgvector) for similar-but-not-identical errors. RAG is dead, but semantic search is pretty neat. operator does not exist bigint character varying and migration type mismatch on installation_id are different strings but the same root cause, and semantic search surfaces that.

... continue reading