We ran a benchmark comparing two ways of letting an AI agent operate the same admin panel, with the goal of putting a price tag on vision agents (browser-use, computer-use).
Here is what we measured, what we had to change to make the vision agent work at all, and what changes when generating an API surface stops being a separate engineering project.
Vision agents are the default for letting AI agents operate web apps that don't expose APIs. The alternative, writing an MCP or REST surface per app, is its own engineering project across the 20+ internal tools most teams have. Most teams default to vision agents not because they are better, but because the alternative is too expensive to build. The cost of the vision approach is treated as a fixed price.
We wanted to measure the price.
The test app is an admin panel for managing customers, orders, and reviews, modeled on the react-admin Posters Galore demo. Two agents target the same running app: one drives the UI via screenshots and clicks, the other calls the app's HTTP endpoints directly. Same Claude Sonnet, same pinned dataset, same task. The interface is the only variable.
The task: find the customer named "Smith" with the most orders, locate their most recent pending order, accept all of their pending reviews, and mark the order as delivered. This touches three resources, requires filtering, pagination, cross-entity lookups, and both reads and writes. It is the shape of work a typical internal tool sees daily.
Path A: Vision agent. Claude Sonnet driving the UI via browser-use 0.12. Vision mode, taking screenshots and executing clicks.
Path B: API agent. Claude Sonnet with tool-use, calling the handlers the UI calls. Each tool maps to one or more event handlers on the app's State, the same functions a button click would trigger. The agent gets the structured response back instead of a rendered page.
All code is open source.
We started by giving both agents the same six-sentence task above and seeing what happened.
... continue reading