Google unveils Gemini 3 claiming the lead in math, science, multimodal and agentic AI benchmarks

After more than a month of rumors and feverish speculation — including Polymarket wagering on the release date — Google today unveiled Gemini 3, its newest proprietary frontier model family and the company’s most comprehensive AI release since the Gemini line debuted in 2023. The models are proprietary (closed-source), available exclusively through Google products, developer platforms, and paid APIs, including Google AI Studio, Vertex AI, the Gemini CLI, and third-party integrations across the broader IDE ecosystem.Gemini 3 arrives as a full portfolio, including:Gemini 3 Pro: the flagship frontier modelGemini 3 Deep Think: an enhanced reasoning modeGenerative interface models powering Visual Layout and Dynamic ViewGemini Agent for multi-step task executionGemini 3 engine embedded in Google Antigravity, the company’s new agent-first development environment.Already, independent AI benchmarking and analysis organization Artificial Analysis has crowned Gemini 3 Pro the "new leader in AI" globally, achieving the top score of 73 on the organization's index, leaping Google from its former placement of 9th overall with the preceding Gemini 2.5 Pro model, which scored 60 behind OpenAI, Moonshot AI, xAI, Anthropic and MiniMax models. As Artificial Analysis wrote on X: "For the first time, Google has the most intelligent model."The launch represents one of Google’s largest, most tightly coordinated model releases. Gemini 3 is shipping simultaneously across Google Search, the Gemini app, Google AI Studio, Vertex AI, and a range of developer tools. Executives emphasized that this integration reflects Google’s control of TPU hardware, data center infrastructure, and consumer products. According to the company, the Gemini app now has more than 650 million monthly active users, more than 13 million developers build with Google’s AI tools, and more than 2 billion monthly users engage with Gemini-powered AI Overviews in Search.At the center of the release is a shift toward agentic AI — systems that plan, act, navigate interfaces, and coordinate tools, rather than just generating text. Gemini 3 is designed to translate high-level instructions into multi-step workflows across devices and applications, with the ability to generate functional interfaces, run tools, and manage complex tasks.Major Performance Gains Over Gemini 2.5 ProGemini 3 Pro introduces large gains over Gemini 2.5 Pro across reasoning, mathematics, multimodality, tool use, coding, and long-horizon planning. Google’s benchmark disclosures show substantial improvements in many categories.Gemini 3 Pro debuted at the top of the LMArena text-reasoning leaderboard, posting a preliminary Elo score of 1501 based on pre-release community voting.That places it above xAI’s newly announced Grok-4.1-thinking model (1484) and Grok-4.1 (1465), both of which were unveiled just hours earlier, as well as above Gemini 2.5 Pro (1451) and recent Claude Sonnet and Opus releases. While LMArena covers only text-reasoning performance and the results are labeled preliminary, this ranking positions Gemini 3 Pro as the strongest publicly evaluated model on that benchmark as of its launch day — though not necessarily the top performer in the world across all modalities, tasks, or evaluation suites.In mathematical and scientific reasoning, Gemini 3 Pro scored 95 percent on AIME 2025 without tools and 100 percent with code execution, compared to 88 percent for its predecessor. On GPQA Diamond, it reached 91.9 percent, up from 86.4 percent. The model also recorded a major jump on MathArena Apex, reaching 23.4 percent versus 0.5 percent for Gemini 2.5 Pro, and delivered 31.1 percent on ARC-AGI-2 compared to 4.9 percent previously.Multimodal performance increased across the board. Gemini 3 Pro scored 81 percent on MMMU-Pro, up from 68 percent, and 87.6 percent on Video-MMMU, compared to 83.6 percent. Its result on ScreenSpot-Pro, a key benchmark for agentic computer use, rose from 11.4 percent to 72.7 percent. Document understanding and chart reasoning also improved.Coding and tool-use performance showed equally significant gains. The model’s LiveCodeBench Pro score reached 2,439, up from 1,775. On Terminal-Bench 2.0 it achieved 54.2 percent versus 32.6 percent previously. SWE-Bench Verified, which measures agentic coding through structured fixes, increased from 59.6 percent to 76.2 percent. The model also posted 85.4 percent on t2-bench, up from 54.9 percent.Long-context and planning benchmarks indicate more stable multi-step behavior. Gemini 3 achieved 77 percent on MRCR v2 at 128k context (versus 58 percent) and 26.3 percent at 1 million tokens (versus 16.4 percent). Its Vending-Bench 2 score reached $5,478.16, compared to $573.64 for Gemini 2.5 Pro, reflecting stronger consistency during long-running decision processes.Language understanding scores improved on SimpleQA Verified (72.1 percent versus 54.5 percent), MMLU (91.8 percent versus 89.5 percent), and the FACTS Benchmark Suite (70.5 percent versus 63.4 percent), supporting more reliable fact-based work in regulated sectors.Generative Interfaces Move Gemini Beyond TextGemini 3 introduces a new class of generative interface capabilities. Visual Layout produces structured, magazine-style pages with images, diagrams, and modules tailored to the query. Dynamic View generates functional interface components such as calculators, simulations, galleries, and interactive graphs. These experiences now appear in Google Search’s AI Mode, enabling models to surface information in visual, interactive formats beyond static text.Google says the model analyzes user intent to construct the layout best suited to a task. In practice, this includes everything from automatically building diagrams for scientific concepts to generating custom UI components that respond to user input.Gemini Agent Introduces Multi-Step Workflow AutomationGemini Agent marks Google’s effort to move beyond conversational assistance toward operational AI. The system coordinates multi-step tasks across tools like Gmail, Calendar, Canvas, and live browsing. It reviews inboxes, drafts replies, prepares plans, triages information, and reasons through complex workflows, while requiring user approval before performing sensitive actions.On a press call with journalists ahead of the release yesterday, Google said the agent is designed to handle multi-turn planning and tool-use sequences with consistency that was not feasible in earlier generations. It is rolling out first to Google AI Ultra subscribers in the Gemini app.Google Antigravity and Developer Toolchain IntegrationAntigravity is Google’s new agent-first development environment designed around Gemini 3. Developers collaborate with agents across an editor, terminal, and browser. The system orchestrates full-stack tasks, including code generation, UI prototyping, debugging, live execution, and report generation.Across the broader developer ecosystem, Google AI Studio now includes a Build mode that automatically wires the right models and APIs to speed up AI-native app creation. Annotations support allows developers to attach prompts to UI elements for faster iteration. Spatial reasoning improvements enable agents to interpret mouse movements, screen annotations, and multi-window layouts to operate computer interfaces more effectively.Developers also gain new reasoning controls through “thinking level” and “model resolution” parameters in the Gemini API, along with stricter validation of thought signatures for multi-turn consistency. A hosted server-side bash tool supports secure, multi-language code generation and prototyping. Grounding with Google Search and URL context can now be combined to extract structured information for downstream tasks.Enterprise Impact and AdoptionEnterprise teams gain multimodal understanding, agentic coding, and long-horizon planning needed for production use cases. The new model unifies analysis of documents, audio, video, workflows, and logs. Improvements in spatial and visual reasoning support robotics, autonomous systems, and scenarios requiring navigation of screens and applications. High-frame-rate video understanding helps developers detect events in fast-moving environments.Gemini 3’s structured document understanding capabilities support legal review, complex form processing, and regulated workflows. Its ability to generate functional interfaces and prototypes with minimal prompting reduces engineering cycles. In addition, the gains in system reliability, tool-calling stability, and context retention make multi-step planning viable for operations like financial forecasting, customer support automation, supply chain modeling, and predictive maintenance.Developer and API PricingGoogle has disclosed initial API pricing for Gemini 3 Pro. In preview, the model is priced at $2 per million input tokens and $12 per million output tokens for prompts up to 200,000 tokens in Google AI Studio and Vertex AI. For prompts that require more than 200,000 tokens, the input pricing doubles to $2 per 1M tok, while the output rises to $18 per 1M tok.When compared to the API pricing for other frontier AI models from rival labs, Gemini 3 is priced in the mid-high range, which may impact adoption as cheaper and open-source (permissively licensed) Chinese models have increasingly come to be adopted by U.S. startups. Here's how it stacks up:ModelInput (/1M tokens)Output (/1M tokens)Total CostSourceERNIE 4.5 Turbo$0.11$0.45$0.56QianfanERNIE 5.0$0.85$3.40$4.25QianfanQwen3 (Coder ex.)$0.85$3.40$4.25QianfanGPT-5.1$1.25$10.00$11.25OpenAIGemini 2.5 Pro (≤200K)$1.25$10.00$11.25GoogleGemini 3 Pro (≤200K)$2.00$12.00$14.00GoogleGemini 2.5 Pro (>200K)$2.50$15.00$17.50GoogleGemini 3 Pro (>200K)$4.00$18.00$22.00GoogleGrok 4 (0709)$3.00$15.00$18.00xAI APIClaude Opus 4.1$15.00$75.00$90.00AnthropicGemini 3 Pro is also available at no charge with rate limits in Google AI Studio for experimentation.The company has not yet announced pricing for Gemini 3 Deep Think, extended context windows, generative interfaces, or tool invocation. Enterprises planning deployment at scale will require these details to estimate operational costs.Multimodal, Visual, and Spatial Reasoning EnhancementsGemini 3’s improvements in embodied and spatial reasoning support pointing and trajectory prediction, task progression, and complex screen parsing. These capabilities extend to desktop and mobile environments, enabling agents to interpret screen elements, respond to on-screen context, and unlock new forms of computer-use automation.The model also delivers improved video reasoning with high-frame-rate understanding for analyzing fast-moving scenes, along with long-context video recall for synthesizing narratives across hours of footage. Google’s examples show the model generating full interactive demo apps directly from prompts, illustrating the depth of multimodal and agentic integration.Vibe Coding and Agentic Code GenerationGemini 3 advances Google’s concept of “vibe coding,” where natural language acts as the primary syntax. The model can translate high-level ideas into full applications with a single prompt, handling multi-step planning, code generation, and visual design. Enterprise partners like Figma, JetBrains, Cursor, Replit, and Cline report stronger instruction following, more stable agentic operation, and better long-context code manipulation compared to prior models.Rumors and RumblingsIn the weeks leading up to the announcement, X became a hub of speculation about Gemini 3. Well-known accounts such as @slow_developer suggested internal builds were significantly ahead of Gemini 2.5 Pro and likely exceeded competitor performance in reasoning and tool use. Others, including @synthwavedd and @VraserX, noted mixed behavior in early checkpoints but acknowledged Google’s advantage in TPU hardware and training data. Viral clips from users like @lepadphone and @StijnSmits showed the model generating websites, animations, and UI layouts from single prompts, adding to the momentum.Prediction markets on Polymarket amplified the speculation. Whale accounts drove the odds of a mid-November release sharply upward, prompting widespread debate about insider activity. A temporary dip during a global Cloudflare outage became a moment of humor and conspiracy before odds surged again.The key moment came when users including @cheatyyyy shared what appeared to be an internal model-card benchmark table for Gemini 3 Pro. The image circulated rapidly, with commentary from figures like @deedydas and @kimmonismus arguing the numbers suggested a significant lead. When Google published the official benchmarks, they matched the leaked table exactly, confirming the document’s authenticity.By launch day, enthusiasm reached a peak. A brief “Geminiii” post from Sundar Pichai triggered widespread attention, and early testers quickly shared real examples of Gemini 3 generating interfaces, full apps, and complex visual designs. While some concerns about pricing and efficiency appeared, the dominant sentiment framed the launch as a turning point for Google and a display of its full-stack AI capabilities.Safety and EvaluationGoogle says Gemini 3 is its most secure model yet, with reduced sycophancy, stronger prompt-injection resistance, and better protection against misuse. The company partnered with external groups, including Apollo and Vaultis, and conducted evaluations using its Frontier Safety Framework.Deployment Across Google ProductsGemini 3 is available across Google Search AI Mode, the Gemini app, Google AI Studio, Vertex AI, the Gemini CLI, and Google’s new agentic development platform, Antigravity. Google says additional Gemini 3 variants will arrive later.ConclusionGemini 3 represents Google’s largest step forward in reasoning, multimodality, enterprise reliability, and agentic capabilities. The model’s performance gains over Gemini 2.5 Pro are substantial across mathematical reasoning, vision, coding, and planning. Generative interfaces, Gemini Agent, and Antigravity demonstrate a shift toward systems that not only respond to prompts but plan tasks, construct interfaces, and coordinate tools. Combined with an unusually intense hype and leak cycle, the launch marks a significant moment in the AI landscape as Google moves aggressively to expand its presence across both consumer-facing and enterprise-facing AI workflows.