OTelBench: AI struggles with simple SRE tasks (Opus 4.5 scores only 29%)

Frontier AI models have become excellent at writing functions, but can they actually debug production systems?

To fix outages, you first need to see what’s happening. In a microservices world, this means producing structured events that track a single request as it hops from service to service.

We asked 14 models to add distributed traces to existing codebases, using the standard method: OpenTelemetry instrumentation. We picked tasks that would be easy for a Site Reliability Engineer (SRE).

Go to OTelBench website for complete charts. All models struggle with OpenTelemetry. Even the best model, Claude 4.5 Opus, succeeded only 29% of the time, and GPT 5.2 was similar at 26%. Surprisingly, Gemini 3 Pro has no edge over Gemini 3 Flash, which scored 19%.

We are releasing OTelBench as an open-source benchmark, with all tasks in QuesmaOrg/otel-bench. We use the Harbor framework (by the creators of TerminalBench), so you can easily run it yourself to reproduce results, test new models, or create benchmarks for your own use cases (we welcome contributions!).

Background: What is distributed tracing?

When an app runs on a single machine, you can often trace an error by scrolling through a log file. But when it runs across 50 microservices, that single request gets scattered into a chaotic firehose of disconnected events. Distributed tracing solves this by linking them back together, allowing you to follow a user action, like clicking Login, as it jumps from the API Gateway, to the Auth Service, to the Database, and back.

User Login API Gateway Auth Database User Svc Trace Waterfall TraceID: abc123def456 0ms 50ms 100ms 150ms 200ms Login 187ms API Gateway 173ms Auth Service 83ms Database 48ms User Service 68ms Distributed tracing links a user action, like a Login button click, to every underlying microservice call.

To make this work, you need instrumentation. This is code that you add to your app to:

Start a trace when a request comes in. Pass the TraceID (context) when your app calls another service. Send the data to a backend so you can see the graph.

... continue reading