Web Bench: a new way to compare AI browser agents
Published on: 2025-06-14 10:57:25
TL;DR: Web Bench is a new dataset to evaluate web browsing agents that consists of 5,750 tasks on 452 different websites, with 2,454 tasks being open sourced. Anthropic Sonnet 3.7 CUA is the current SOTA, with the detailed results here.
Over the past few months, Web Browsing agents such as Skyvern, Browser-use and OpenAI's Operator (CUA) have taken the world by storm. These agents have been used in production for a variety of tasks, from helping people apply to jobs, downloading invoices, and even doing SS4 filings for newly incorporated companies.
Skyvern attempting to purchase a product
Skyvern attempting to fill out the IRS form
Most agents report state of the art performance, but we find that browser agents still struggle with a wide variety of tasks, particularly ones involving authentication, form filling and file downloading.
This is because the standard benchmark today (WebVoyager) focuses on read-heavy tasks and consists of only 643 tasks across only 15 websites (out of 1
... Read full article.