HomeArticle

Dissecting Gemini 3: The Ultimate Execution of Scaling Law and the Power of "Full Modality"

硅谷1012025-11-24 11:52
Google's comeback

Undoubtedly, Google's newly launched Gemini 3 has once again shaken up the AI landscape in Silicon Valley. While OpenAI and Anthropic are locked in a fierce battle, Google, with its profound infrastructure foundation and the Native Multimodal approach, has transformed from a "chaser" into a "leader."

This time, Gemini 3 not only achieved a new leap in multimodal capabilities but is also regarded as Google's most extreme implementation of the Scaling Law.

Silicon Valley 101 hosted a live stream on November 20, inviting four guests at the forefront of AI research and application:

  • Tian Yuandong, former research director at Meta FAIR and AI scientist
  • Chen Yubei, assistant professor at the University of California, Davis, and co-founder of Aizip
  • Gavin Wang, former Meta AI engineer, responsible for post-training of Llama 3 and multimodal reasoning
  • Nathan Wang, senior AI developer and special researcher at Silicon Valley 101

Through the release of Gemini 3, we attempted to answer several key questions about the future of AI: What exactly makes Gemini 3 so powerful? What did Google do right? How will the global competition landscape of large models change? What is the future direction of LLMs, and what are the most cutting - edge AI labs focusing on beyond LLMs?

Below are the condensed views of our guests during the live stream. If you want to watch the full live - stream content, you can follow our replays on YouTube and Bilibili.

01 Experience Test: What Exactly Makes Gemini 3 So Powerful?

Within 48 hours after the release of Gemini 3, major rankings were quickly refreshed. Different from previous models that only improved in a single dimension (such as code or text), Gemini 3 is considered a truly "native multimodal" model. For users, how does this improvement in technical parameters translate into actual experiences?

Source: LM Arena

Chen Qian: Everyone has been intensively testing Gemini 3 in the past two days. Is it really as dominant as shown in the rankings? Can you give examples of what makes it so good?

Nathan Wang: In the past two days, I mainly used three products: the main Gemini App, Google AntiGravity for developers, and the newly released Nano Banana Pro today.

To be honest, AntiGravity feels very much like an IDE (Integrated Development Environment) in the Agentic era. The difference between it and Cursor or Claude Code is that it divides the interface into a "Manager View" and an "Editor View."

Previously, in Cursor, although the AI helped us write code, it still felt like "I" was writing. But in AntiGravity, the Manager View makes you feel like you are a manager sitting there, with 8 to 10 Agent subordinates working. You can watch them divide the work, with some writing programs and some running unit tests.

What's most amazing is its combination with the Browser Use function. For example, when I wrote a front - end web page, it has a function called Screenshot Pro with a very high score. It can directly call the Chrome browser to open the web page and "look" at the screen for testing. If you ask it to upload a file or click a button, it can operate like a human. This means that testing and development are fully automated, becoming an integrated development experience.

Additionally, Nano Banana Pro solved a major pain point for me in generating slides. Previously, when I asked AI to make a PPT, such as "explain the development route of Gemini from 1.0 to 3.0," the logical chain was often broken. But this time, when I tried it, it not only sorted out the logic but also generated very complex charts. I think those slide - making software on the market may be replaced by it.

Tian Yuandong: Former research director at Meta FAIR and AI scientist

Tian Yuandong: My general habit is to see if a new model can "continue writing a novel" when it comes out. This is my personal benchmark. Since few people in the world test it this way, it definitely won't overfit, which is relatively objective.

One or two years ago, when models wrote novels, they basically had an "official document style." No matter what beginning you gave it, what it wrote was in an official tone, completely out of context. By the time of Gemini 2.5, I found that its writing style had improved. For example, when I gave it a scene of a ruin, it would describe it in detail: the way the walls collapsed and the desolate atmosphere of the environment, like what a liberal arts student would write, but the plot was straightforward and not very engaging.

But this time, Gemini 3 surprised me. Not only does it have good writing, but it also begins to understand "reversals." The plot interactions it designs are very interesting, and I even thought, "Hey, this idea is good. Maybe I can save it for my own novel." This is the first time I felt that AI inspired me in plot conception, rather than just piling up words. It seems to understand the deep - seated motives of the author.

However, in scientific research brainstorming, it's still the same as before. How to describe it? It's like a newly enrolled, well - informed doctoral student. You can ask it anything and it knows the answers, spitting out many new terms and new mathematical tools. You'll think, "Wow, I've never seen this before. It's great." But if you want to discuss the essence of a problem in depth with it or ask it to judge which direction is more promising, it can't do it. It lacks the intuition and deep thinking that only senior human researchers have. So it is still a top - notch "test - taker," but there has been no fundamental breakthrough in creative thinking for now.

Gavin Wang: First of all, I'm amazed at Google's "power as a big company." Its ecosystem is so complete. Technically, I'm most concerned about the ARC - AGI - 2 Benchmark. This test is very interesting. It doesn't test big - data memory but few - shot learning or even meta - learning. Its founder believes that memorizing data is not intelligence. Real intelligence means being able to quickly extract patterns from one or two examples.

Previously, everyone's scores on this ranking were in single digits or in the teens. Gemini 3 suddenly reached more than 30%, which is a qualitative leap. I think this is due to its multimodal reasoning.

Previously, in the chain of thoughts, the model was talking to itself, a single - modal progression in the pure language dimension. But Gemini 3 is model - native. It combines visual, code, and language data for pre - training. So when reasoning, it may be looking at the image on the screen while making logical deductions in the language dimension. This cross - modal reaction opens up many new opportunities.

Chen Yubei: Assistant professor at the University of California, Davis, and co-founder of Aizip

Chen Yubei: I've been too busy in the past two days to test it myself, but I've collected first - hand feedback from different groups in our team, including some interesting negative feedback.

First, the feedback from the Vision group. When they were doing some internal benchmark tests, they found that Gemini 3's performance in real - world visual understanding actually declined. Sounds counter - intuitive, right?

Specifically, when dealing with real - world scenarios such as security cameras and doorbells to analyze user behavior and potential risk events, its performance was worse than the previous generation. They checked the technical report of Gemini 3 and found that there was only one benchmark related to real - world visual understanding in the report, and it did not cover such complex scenarios.

This actually exposes a common problem in the industry: there is a huge gap between public benchmarks and actual implementation scenarios. If everyone optimizes models just to improve their rankings, the performance in actual products may deviate.

Additionally, students in the Coding group also told me that when doing scientific writing and auxiliary programming, they actually found Gemini 2.5 more convenient. Although the reasoning length of Gemini 3 has increased by 2 to 3 times, when dealing with extremely complex tasks such as multi - hop searches and integrating twenty - year financial reports, it still seems less stable than OpenAI's GPT - 5 Pro. Of course, this may be because it's an early version and we haven't fully grasped the prompts yet.

02 Google's Technological Secret: Is It "Deep Thinking" or "Superpower"?

Google has gone from being behind to catching up and even overtaking. The person in charge of the Gemini project once revealed that the secret lies in "improving pre - training and post - training." Behind this seemingly official answer, what kind of technological roadmap does Google have? Is it a victory of the algorithm itself or the brute - force aesthetics of piling up computing power?

Tweet on the X platform by Oriol Vinyals, chief scientist at Google DeepMind

Chen Qian: This time, Google not only caught up but also surpassed. The person in charge of the Gemini project mentioned in the press conference that the new version "improved pre - training and post - training." Does this mean that the Scaling Law hasn't "hit a wall"? What is Google's secret weapon?

Tian Yuandong: To be honest, the statement "improved pre - training and post - training" is basically nonsense (laughs). Because building a model is a systematic project. If the data is better, the architecture is fine - tuned, and the training stability is enhanced, with each aspect getting a little better, the final result will definitely be stronger.

But what I'm more concerned about is that if pre - training is done well enough and the model itself becomes very "smart," then in the post - training stage, it will behave like a genius student, mastering the knowledge with just a few samples and not requiring much effort to teach. It seems that the base capabilities of Gemini 3 are indeed very strong.

Regarding whether it uses any secret weapons, I've heard some rumors that Google finally fixed some bugs in its previous training process. Of course, this is just a rumor and cannot be confirmed. However, for a company of Google's scale, as long as it doesn't make engineering mistakes and optimizes all the details, the Scaling Law will naturally take effect.

Gavin Wang: Former Meta AI engineer, responsible for post - training of Llama 3 and multimodal reasoning

Gavin Wang: Yesterday, I tried to chat with Gemini 3 and asked it, "Why are you so powerful?" (laughs). It analyzed it for me and mentioned a concept called Tree of Thoughts.

Previously, when we did CoT (Chain of Thoughts), it was linear, like a linked list progressing step by step. But Gemini 3 seems to use a tree - like search inside the model, combined with a self - rewarding mechanism. That is, it runs multiple ideas simultaneously inside and has a scoring mechanism. It drops the ideas that don't make sense and continues to adapt the promising ones.

This is actually a deep combination of engineering wrapper and model science. Previously, we had to write prompts outside to achieve this, but now Google has integrated it into the model's internal environment. This is not only piling up resources in the vertical direction according to the Scaling Law but also introducing the MoE and Search mechanisms in the horizontal direction. It reminds me of the GPT moment three years ago. Technically, it's very impressive.

Nathan Wang: I'd like to add a detail. When I was looking through the Gemini developer API documentation, I found an Easter egg. In a line of comments, it said, "Context Engineering is a way to go."

This sentence made me think for a long time. Previously, we talked about prompt engineering. Now, Google is talking about context engineering. My personal experience is that, for example, if I want to write a tweet that can go "viral," I'll first ask the AI to search for "how to write a popular tweet" and let it summarize the methodology as context, and then fill in my content for generation.

Google seems to have automated this process. Before the model generates an answer, it may have automatically grabbed a large amount of relevant context in the background, constructed an extremely rich chain - of - thought environment, and then generated the result. This may be why it seems to "understand you" when using it. It's not just answering but thinking in an engineered environment.

Chen Yubei: Besides the algorithm level, I'd like to mention a more fundamental economic perspective. My friend Brian Cheng once put forward a view that I think is very on - point: Google can so firmly and thoroughly implement the Scaling Law because it has an unparalleled hardware advantage - TPU.

Think about it. If other companies want to train large models, they have to buy NVIDIA's graphics cards. NVIDIA's hardware profit margin is over 70%. But Google is different. It has a complete integration of software and hardware. It uses its own TPU without middlemen making a profit. This makes its unit economy extremely excellent. With the same budget, Google can train larger models, process more data, and conduct more expensive multimodal experiments.

So, as long as the Scaling Law still requires piling up computing power, Google's asymmetric hardware advantage will put great pressure on OpenAI and Anthropic. Unless NVIDIA lowers its prices or other companies develop their own chips, this moat is very deep.

03 Developer Ecosystem: Is the Coding Battle Over?

With the release of Gemini 3 and AntiGravity, and their dominance on code rankings such as SWE - bench, there are remarks on social media saying that "the coding battle is over." Is Google using its vast ecosystem (Chrome, Android, Cloud) to build an insurmountable moat for startups like Cursor?

Chen Qian: Many people say that the coding battle is over. Gemini 3 combined with Google's suite of products will sweep away all competitors. What does this mean for startups like