rdworldonline.com

Google’s Gemini 2.5 Pro model tops LMArena by close to 40 points, outperforms competitors in…

Gemini 2.5

[Image courtesy of Google]

Large language models are getting better at STEM. And the latest model from Google, Gemini 2.5 Pro Experimental 03-25, which emerged on the heels of its Gemini 2.0 series, is no exception. Announced March 25, this experimental update is touted as Google’s most intelligent AI yet, demonstrating significant gains in complex problem-solving and coding.

The model debuted at the top spot on the LMArena leaderboard “by a significant margin.” This benchmark measures human preferences, indicating the model is not only capable but also produces high-quality, preferred outputs. In addition, the model achieved a score of 18.8% on the State-of-the-Art on Humanity’s Last Exam, a challenging dataset (designed to capture the “human frontier of knowledge and reasoning”) without using external tools. The model also leads in Math & Science (GPQA, AIME 2025) without test-time techniques like majority voting (which OpenAI’s o1-Pro is rumored to use). The model also scored 63.8% on SWE-Bench Verified (an industry standard for agentic code evaluations) using a custom agent setup.

There were a few areas where competitors fared better (See the table at the very bottom of this article). One was GPQA diamond (with multiple attempts): Claude 3.7 Sonnet scored 84.8% compared to Gemini 2.5 Pro’s 84.0% (single attempt). SWE-bench verified (agentic coding): Claude 3.7 Sonnet achieved 70.3% while Gemini 2.5 Pro scored 63.8%. Grok 3 Beta showed impressive performance with multiple attempts, scoring 93.3% on both AIME 2025 and AIME 2024 benchmarks, compared to Gemini’s 86.7% and 92.0% single-attempt scores respectively. OpenAI models also had their strengths, with o3-mini outperforming on LiveCodeBench v5 (74.1% vs. Gemini’s 70.4%) and GPT-4.5 dominating in factuality with a 62.5% score on SimpleQA compared to Gemini’s 52.9%.

A screenshot of the LMArena dashboard (top 10) as of March 25

Gemini 2.5 Pro’s performance on the GPQA Diamond benchmark represents a big step up in scientific reasoning capabilities with a score of 84%. This benchmark is challenging as it contains 198 graduate-level questions across biology, physics, and chemistry designed to be “Google-proof” – requiring deep domain knowledge rather than simple information retrieval even top Ph.D.-level experts find these questions extremely difficult. Models like OpenAI’s o1 and Claude 3.5 Sonnet have struggled to exceed 60% accuracy on this exam.

Gemini 2.5 is also a “thinking model” — similar to o1 from OpenAI or R1 from DeepSeek in that respect — or the extended thinking mode of Claude 3.7 Sonnet. Unlike traditional models that primarily focus on classification and prediction, Gemini 2.5 is designed to reason through its thoughts before generating responses. This architecture enables the model to analyze information more thoroughly, draw logical conclusions, incorporate context and nuance, and make more informed decisions. Built upon Google DeepMind’s previous work with reinforcement learning and chain-of-thought prompting, Gemini 2.5 achieves its enhanced performance by combining a significantly enhanced base model with improved post-training. Google notes in a blog post that going forward, the company is “building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.”

Gemini 2.5 ships with a 1 million token context window (with 2 million coming soon). It also features native multimodality — the ability to comprehend vast datasets and handle complex problems from different information sources, including text, audio, images, video, and even entire code repositories. This builds on the multimodal strengths of previous Gemini models, though specific performance metrics on multimodal benchmarks like MMMU or MathVista weren’t provided in the announcement.

Google specifically notes that Gemini 2.5 Pro’s benchmark results were achieved “Without test-time techniques that increase cost, like majority voting,” which they contrast with other models’ approaches. The SWE-Bench Verified score of 63.8% was achieved “with a custom agent setup,” though details of this setup weren’t specified in the announcement.

While Google’s announcement highlights Gemini 2.5 Pro’s impressive benchmark performance, several technical details remain undisclosed. The blog post doesn’t mention the model’s parameter count, training dataset composition, or computational requirements for training and inference. Google also doesn’t explicitly discuss known limitations or failure modes beyond labeling the release as “Experimental” and mentioning they “welcome feedback so we can continue to improve Gemini’s impressive new abilities at a rapid pace.”

It is available via Google AI Studio and for Gemini Advanced subscribers.

Here is the benchmark comparison from Google:

Read full news in source page