Benchmarking leading AI agents against Google reCAPTCHA v2

We evaluate three leading AI models—Claude Sonnet 4.5 (Anthropic), Gemini 2.5 Pro (Google), and GPT-5 (OpenAI)—on their ability to solve Google reCAPTCHA v2 challenges. Compared to Sonnet and Gemini, GPT-5's long and slow reasoning traces led to repeated challenge timeouts and significantly lower performance.

Many sites use CAPTCHAs to distinguish humans from automated traffic. How well do these CAPTCHAs hold up against modern AI agents? We tested three leading models—Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT-5—on their ability to solve Google reCAPTCHA v2 challenges and found significant differences in performance. Claude Sonnet 4.5 performed best with a 60% success rate, slightly outperforming Gemini 2.5 Pro at 56%. GPT-5 performed significantly worse and only managed to solve CAPTCHAs on 28% of trials.

Figure 1: Overall success rates for each AI model. Claude Sonnet 4.5 achieved the highest success rate at 60%, followed by Gemini 2.5 Pro at 56% and GPT-5 at 28%.

Each reCAPTCHA challenge falls into one of three types: Static, Reload, and Cross-tile (see Figure 2). The models' success was highly dependent on this challenge type. In general, all models performed best on Static challenges and worst on Cross-tile challenges.

Model Static Reload Cross-tile Claude Sonnet 4.5 47.1% 21.2% 0.0% Gemini 2.5 Pro 56.3% 13.3% 1.9% GPT-5 22.7% 2.1% 1.1% Figure 2: The three types of reCAPTCHA v2 challenges. Static presents a static 3x3 grid; Reload dynamically replaces clicked images, and Cross-tile uses a 4x4 grid with objects potentially spanning multiple squares. The table shows model performance by CAPTCHA type. Success rates are lower than in Figure 1 as these rates are at the challenge level, rather than trial level. Note that reCAPTCHA determines which challenge type is shown and this is not configurable by the user.

Model analysis

Why did Claude and Gemini perform better than GPT-5? We found the difference was largely due to excessive and obsessive reasoning. Browser Use executes tasks as a sequence of discrete steps — the agent generates "Thinking" tokens to reason about the next step, chooses a set of actions, observes the response, and repeats. Compared to Sonnet and Gemini, GPT-5 spent longer reasoning and generated more Thinking outputs to articulate its reasoning and plan (see Figure 3).

These issues were compounded by poor planning and verification: GPT-5 obsessively made edits and corrections to its solutions, clicking and unclicking the same square repeatedly. Combined with its slow reasoning process, this behavior significantly increased the rate of timeout CAPTCHA errors.

Figure 3: Average number of "Thinking" characters by model and grid size (Static and Reload CAPTCHAs are 3x3, and Cross-tile CAPTCHAs are 4x4). On every agent step, the model outputs a “Thinking” tag along with its reasoning about which actions it will take.

CAPTCHA type analysis

... continue reading