ChatGPT 4.1 early benchmarks compared against Google Gemini
Published on: 2025-04-26 22:20:15
ChatGPT 4.1 is now rolling out, and it's a significant leap from GPT 4o, but it fails to beat the benchmark set by Google Gemini.
Yesterday, OpenAI confirmed that developers with API access can try as many as three new models: GPT‑4.1, GPT‑4.1 mini, and GPT‑4.1 nano.
According to the benchmarks, these models are far better than the existing GPT‑4o and GPT‑4o mini, particularly in coding.
For example, GPT‑4.1 scores 54.6% on SWE-bench Verified, which is better than GPT-4o by 21.4% and 26.6% over GPT‑4.5. We have similar results on other benchmarking tools shared by OpenAI, but how does it compete against Gemini models.
ChatGPT 4.1 early benchmarks
Benchmarks comparing LLMs
According to benchmarks shared by Stagehand, which is a production-ready browser automation framework, Gemini 2.0 Flash has the lowest error rate (6.67%) along with the highest exact‑match score (90%), and it’s also cheap and fast.
On the other hand, GPT‑4.1 has a higher error rate (16.67%) and costs over 10 ti
... Read full article.