HomeArticle

Understand Google's Most Powerful Model Gemini 3 at a Glance: The Biggest Surprise in the Second Half of the Year, the Return of the Google Dynasty

36氪的朋友们2025-11-19 11:03
The king returns.

After minor upgrades to GPT-5, Grok 4, and Claude Sonnet, the AI field entered a period of relative calm in the second half of 2025.

It wasn't until today that the release of Gemini 3 completely shattered this calm.

The leap - forward score improvement, powerful multimodal understanding, more diverse UI, and amazing front - end capabilities have all truly enabled AI to take a significant step towards the form we envisioned.

These visible improvements are far more obvious than the scores on benchmarks and the differences that can only be detected in coding systems.

Gemini 3 is not just a change in the version number. It is a forceful reaffirmation of the belief in the Scaling Law and Google's first model that truly outshines OpenAI.

At this press conference, the Google we are familiar with has made a comeback. It is not satisfied with leading in a single dimension but aims to make efforts on all fronts, including model capabilities, developer tools, user experience, search integration, and multilingual coverage.

This is a platform - level ambition to redefine every touchpoint of the entire Google ecosystem with AI.

01 Benchmark Leap

Benchmark tests have always been controversial in the AI circle and are jokingly referred to as "exam - style competitions." In the past few months, the score differences between top - tier models were only a few percentage points, with each model closely chasing the others.

However, the emergence of Gemini 3 has turned the previously close competition into a one - sided victory.

Let's first look at the basic thinking ability. Humanity's Last Exam (HLE) is the ultimate test to measure whether an AI can solve humanity's most challenging problems. Before Gemini 3, Gemini 2.5 Pro scored 21.6%, and Claude Sonnet 4.5 only scored 13.7%. Gemini 3 Pro achieved scores of 37.5% (without tools) and 45.8% (with tools).

ARC - AGI - 2 test, known as the Turing test in the AI field, aims to measure a model's ability to handle novel reasoning tasks it has never seen before, rather than rote memorization.

Gemini 3 Pro scored 31.1%, while GPT - 5.1 only scored 17.6%, and Gemini 2.5 Pro even scored only 4.9%. This means it begins to exhibit a fluid intelligence similar to that of humans, capable of abstract reasoning in areas not covered by a large amount of training data.

Even François Chollet, the founder of the ARC Prize, tweeted after seeing the results: "We've just verified that Gemini 3 Pro and Deep Think outperform the SOTA by more than 2x on ARC v2! This is really impressive, and to be honest, a bit unexpected."

Moreover, Gemini 3 Pro solved the fastest v2 task using only 772 tokens and 188 seconds, almost approaching the average speed of 147 seconds of the human review team.

In terms of mathematical ability, Gemini 3 introduced a new MathArena Apex competition - level benchmark to highlight its superiority. In this test, Gemini 2.5 Pro scored only 0.5%, Claude Sonnet 4.5 scored 1.6%, and GPT - 5.1 scored 1.0%. Gemini 3 Pro achieved a score of 23.4%.

In the multimodal field, Google's strong suit, Gemini 3's performance is even more astonishing.

It scored 81.0% in MMMU - Pro and 81.4% in CharXiv Reasoning, surpassing its competitors. In the ScreenSpot - Pro test for understanding screenshots, where it scored 72.7%, Gemini 3's score was twice that of Claude Sonnet 4.5 and twenty times that of GPT - 5.1. This is crucial for building AI agents that can truly understand and operate graphical interfaces.

02 Coding Ability: Google's Weakness Turns Around Completely

Although in the SWE - Bench Verified test, which measures real - world software engineering ability, Gemini 3's score of 76.2% is still lower than Claude's 77.2%.

However, in other core third - party tests, Google far outperforms its competitors. In the LiveCodeBench, Gemini 3's score is more than 200 points higher than that of the second - ranked Grok 4.1.

In the 12 - bench test for measuring an agent's tool - using ability, Gemini 3 Pro scored 85.4%, far exceeding Gemini 2.5 Pro's 54.9%. In the Terminal - Bench 2.0, which is more in line with the terminal environment, Gemini 3 scored 54.2%, 11 percentage points higher than the second - ranked model.

This is largely a demonstration of comprehensive abilities.

With better screen - understanding ability and front - end aesthetics derived from multimodal capabilities, Gemini 3 can easily outperform its competitors in real - world programming environments.

For example, in the Design Arena, a practical coding arena operated by the developer community, Gemini 3 Pro ranked first overall and topped four out of five coding categories: websites, game development, 3D design, and UI components. This is the largest performance gap since the launch of the Design Arena.

Memory has always been a major bottleneck for models. Therefore, the improvement of Gemini 3's long - context ability is also worthy of attention.

It scored an average of 77.0% in the 28k context of the MRCR v2 benchmark, far exceeding its competitors, and 26.3% in the point - by - point score of the 1M context.

This shows that Gemini 3 doesn't simply "stuff" more tokens but truly understands and utilizes the information in long documents.

According to the analysis by Artificial Analysis, Gemini 3 also performs strongly in factual recall.

Finally, let's look at the comprehensive ability. Vending - Bench 2 is a benchmark that measures an AI model's ability to operate a business over a long period. The model needs to operate a simulated vending machine business for a year, and the bank account balance at the end of the year is used as the scoring standard.

This test has become quite popular this year. As benchmarks are becoming saturated and it's difficult to implement agents, companies are more concerned about whether a model can maintain performance in complex tasks that require long - term, multi - step operations and continuous state tracking. Gemini 3 achieved an average net worth of $5,478.16, a significant leap compared to GPT - 5.1's $1,473.43 and Gemini 2.5 Pro's $573.64.

In addition to the Pro version, Gemini 3 also introduced the Deep Think mode. This is Google's response to the Hard mode introduced by models like OpenAI. Although its benchmark level is higher than that of the Pro version, its token consumption is also approximately an order of magnitude higher.

There is no suspense in the final ranking by Artificial Analysis: Gemini 3 Pro ranks first with a significant advantage, 3 points higher than GPT - 5.1.

This is the first time Google has taken the leading position with an absolute advantage in the language models it has launched, ending OpenAI's long - standing dominance.

However, beyond the numbers, the actual user experience is more important.

A developer named Tailen wrote after early testing: "This model far surpasses GPT - 5 Pro, Gemini 2.5 Deep Think, and all other models in my most difficult problems." He listed the areas where Gemini 3 has established a new SOTA: debugging complex compiler errors, refactoring files without logical errors, solving difficult λ - calculus problems, and even "almost doing well" in ASCII art.

03 The Dusk of Front - End Development

Gemini 3's dominant performance in the Design Arena is not accidental. Developers have found that Gemini 3 can not only write functionally correct code but, more importantly, understand aesthetics. In many designs, we can see that the responsive design is natural and smooth, the color scheme conforms to modern aesthetics, the animation effects are just right, and accessibility is well - considered.

Part of the source of this aesthetic intelligence is the training data. According to the Model Card of Gemini 3, its training data includes a large amount of image, video, and web data. This indicates that the model has not only learned how to code but also what kind of interfaces are beautiful and well - laid - out.

Taking advantage of this front - end advantage, Google launched "Generative UI". Traditional conversational AI provides text responses, and more advanced ones provide structured data or charts. However, Generative UI means that the AI dynamically generates a completely customized user interface for each request.

This has completely changed the paradigm of human - machine interaction and has become the most obvious point of improvement that users can intuitively feel.

At the press conference, Google's example was "How does RNA polymerase work?" Gemini 3 generated an intuitive, clickable interactive tool.

The reason it's called "customized" is that the model can change its design according to the user's intention, usage scenario, and target audience. When explaining microorganisms to a 5 - year - old child and an adult, Gemini 3 knows that it requires completely different interface designs, interaction modes, and content depths. It can infer that children need large buttons, bright colors, simple language, and gamification elements, while adults need more information density, professional terms, and in - depth explanations.

This is exactly the ability that the new - generation AI should have, going beyond conversations and becoming a multi - information chimera.

In multi - round conversations, Gemini 3 can understand your aesthetic preferences, coding style, and even the design principles you haven't explicitly stated. If you prefer