Two big names have left one after another. What on earth is the bug with Google?
Within less than a week, two heavyweight employees have left Google one after another.
First, Noam Shazeer, the former vice president of engineering at Google DeepMind, and then John Jumper, the core person in charge of AlphaFold.
To be honest, I can't help but suspect that Google is having some "glitches" now.
It's been more than half a year since the launch of Gemini 3, and Google still only has Gemini 3.1, which isn't much different. Look at Anthropic. Half a year ago, it was only Opus 4.5, and now Fable 5 has been out of service for over a week.
It's not just that Google is falling behind in terms of models; its products are also lagging. Nowadays, almost all AI companies are focusing on AI Agents. OpenAI has Codex, and Anthropic has Claude Code.
With the support of Fable 5, Claude Code can now fix bugs autonomously, run tests in an automatic loop until all tests pass, generate production code directly from design drafts, and finally package it into a complete software.
On the other hand, Google only has an unimpressive Antigravity 2.0. Not only is its performance poor, but the user experience is also hard to describe. There are only complaints about this product online.
Speaking of this, I have to mention one thing. Berkshire started building its position in Google in 2025. By the first quarter of 2026, Berkshire increased its position in Google by 224%.
On June 1, 2026, Berkshire invested an additional $10 billion in Google's parent company, Alphabet, through a private placement.
Has Buffett really made a wrong call this time?
01
Where has Google's full - stack advantage gone?
On November 18, 2025, Google released Gemini 3. Sundar Pichai personally came out to support it, saying that this was Google's "smartest model", with the world's top - notch reasoning ability, multi - modal understanding, and code generation ability.
So on the same day, Google also released two other things: one is Google Antigravity, a development platform known as "agent - first"; the other is Nano Banana Pro, a more powerful version of Google's previously popular text - to - image model, Nano Banana.
How intimidating was Google at that time? Well, two weeks after Google's product launch event, Sam Altman sent a "Code Red" memo to OpenAI's internal staff, stating that the product experience and quality advantages of ChatGPT were being rapidly caught up by Google. Therefore, the entire company suspended all other business and gathered all employees to work on ChatGPT.
What Altman was worried about was not just these three products, but Google's full - stack advantage.
In terms of hardware, Google has its self - developed TPU chips. Google started working on TPU in 2015, and as of today, it has reached the seventh - generation Ironwood. One chip has the computing power equivalent to four previous chips. With liquid - cooling heat dissipation, a pod can hold 9,216 chips, providing 42.5 ExaFlops of computing power.
Different from NVIDIA's general - purpose GPUs, TPUs are specifically optimized for AI inference tasks. They have low costs and better performance.
One level up is DeepMind.
In April 2023, Google merged Google Brain and DeepMind into one unit. Previously, although they were part of the same company, they had long - standing separate systems and cultures. Brain was more focused on products and commercialization, while DeepMind was more focused on long - term research.
After the merger, Demis Hassabis led the unified team, and Jeff Dean stepped back to become the chief scientist. That is to say, Google's "left and right brains" were combined.
Going further up, there is something that many people tend to overlook: the entry points. Google doesn't just have models. It has Chrome, Android, YouTube, Google Maps, Gmail, Google Workspace, and Google Search.
Combined, these have billions of daily active users. No AI company in the world has this scale of users. It can promote products through these entry points and then use these mature products to get user feedback, accelerating the development and iteration of the entire product.
For example, at which step do users exit, which capabilities are repeatedly called, which generated results are modified or abandoned by users, which functions lead to user retention, and in which scenarios there are a large number of errors and complaints.
Take Nano Banana as an example.
Although this product is very small in scale, it actually has its own complete flywheel through Google's full - stack.
After Nano Banana became popular in blind - testing environments like LM Arena, the first thing Google did was to immediately launch it on the Gemini App, AI Studio, and Gemini API. It even didn't miss the Vertex AI, which is specifically for enterprises.
Users can not only experience Nano Banana through various products, but Google can also use these products to collect feedback. That's why Nano Banana has such a fast product iteration speed and can outperform GPT - 4o in image - generation ability.
So why has Google's full - stack advantage disappeared now?
Text - to - image is a low - risk, short - chain product with immediate results.
The user inputs a sentence, and gets an image within tens of seconds. If not satisfied, they can try again; if satisfied, they can share it. It doesn't require long - term memory, tool permission calls, or bearing real - world consequences for a mistake.
But Agents are different. It's not about "giving the user a result". It has to be completely integrated into the user's work environment, continuously read the context, call tools, perform operations, and be responsible for the final result.
The success of Nano Banana cannot be fully replicated in Agents.
When a product needs to cross models, permissions, execution environments, enterprise systems, and long - term responsibilities, Google's originally powerful full - stack capabilities start to show problems in coordination.
02
Google's real problem is a chaotic organizational structure
If you look at Google's developer product line, you'll find a very strange phenomenon. Google has several tools at the same time, all helping you write code with AI, and their product functions almost overlap.
Gemini CLI, a command - line tool, can search code libraries, generate applications, and automatically execute complex processes. It was launched along with Gemini 3 at the end of 2025. In June 2026, Google issued an announcement: Gemini CLI will be replaced by Antigravity CLI.
Jules, an asynchronous coding Agent, is from Google Labs. It's designed to automatically fix bugs, write tests, and submit Pull Requests for you. You don't need to keep an eye on it. You just give it a task, and it will clone the repository, write code, open a PR, and notify you when it's done.
Code Assist, an enterprise - level programming assistant under Google Cloud, is used in VS Code and JetBrains. It costs $22.8 to $54 per user per month. Firebase Studio, a full - stack development workbench in the browser, has Gemini built - in and can also help you generate code.
Then there's the ever - underperforming Antigravity. As mentioned before, a 2.0 version was released at the I/O Conference in May 2026, divided into five parts: desktop App, CLI, SDK, Managed Agents, and the enterprise layer.
They are all doing the same thing, but they are developed by different teams, have different brand names, different entry points, different charging models, and some even replace each other.
This situation is not about having a rich product line; it's a waste of computing power.
The root cause of this actually lies in the organizational structure.
Google's AI Agent - related capabilities are split among at least several independent organizations. Each organization has its own KPIs and independent reporting lines.
For example, Google DeepMind is concerned with whether the model's scores on benchmarks can surpass those of GPT and Claude. The success of DeepMind is "we've created the most powerful model".
It doesn't care at all about the success rate of users completing a real project in Antigravity.
In the Google Labs department, it only cares about whether something is cool and can trigger discussions on social media.
The products of Google Labs include CC (the AI assistant in Gmail), Project Genie (infinite world generation), Pomelli (an AI marketing tool), Opal (creating small applications with natural language), and Jules.
After the experiment is over and the hype fades, the team may move on to the next experiment and won't maintain the product in the long term.
Google Cloud and Vertex AI are concerned with whether the model can be called through the API, whether enterprises can purchase it, whether permissions and compliance are covered, and whether Agents can be deployed in the production environment.
Antigravity is even more pitiful. It originated from Google DeepMind and is now maintained by Google Labs. But it must also be integrated into Google Cloud's permission, deployment, and compliance systems.
So no one will take responsibility for it, and it just languishes.
You may ask, what about Pichai? How can he deal with this?
DeepMind says, "Our model has topped the charts again." Labs says, "Jules has had 100,000 reposts on social networks again." Then Google Cloud says, "The Agent Engine has signed so many enterprise customers." The Gemini App says, "This month's DAU has been stable." Search says, "The number of users of AI Overviews has exceeded 2 billion."
Everyone has kept their jobs, but in the end, Antigravity is left in a mess.
But no one can answer the simplest question: Which Google tool should a developer use today to complete their work? If they are currently using Codex or Claude Code, which Google product is going to win them over?
03
Winning the evaluation doesn't mean the task is really delivered
All of Google's current narratives revolve around scores, but people don't blindly trust benchmarks anymore. A good model is one that can deliver tasks.
A model may have high scores on benchmarks. For example, it can answer reasoning questions correctly, generate code, understand images, and maintain coherence in multi - round conversations.
These tests are usually conducted in a controlled environment. Single - round or limited - round, with clean input and output, no need to operate external tools, no need for management permissions, and no need for long - term continuous operation.
What does failure look like? The answer is wrong. The worst - case scenario is to try again.
But when it comes to task delivery, the value of the model changes.
When a user gives a real - world task to AI and finally gets a usable result, the process in between is actually very long.
What is a "real - world task"? It's something like "There's a bug in the payment module of this project. Please locate it, fix it, test it, and submit a PR." It involves multiple steps, which may take tens of minutes or even hours. In the middle, it may need to call Git, the terminal, the browser, the file system, and the API. Each step has the possibility of failure.
What does failure look like in this case? It's not that the answer is wrong, but that the code is broken, permissions can't be controlled, the process gets stuck, the environment crashes, and the user doesn't know where to start to recover.
Let me give you an example.
Suppose a model has a 95% correct rate in single - step judgment. It seems very powerful. But if a real development task requires 20 consecutive key steps to be completed, the probability of all steps being error - free is only 0.95^20, which is approximately 36%.
Even if the single - step correct rate is increased to 98%, the probability of successfully completing all 20 steps is only about 67%.
So the real moat for Agent products is not to raise the benchmark score by another two points, but to establish a reliable mechanism for error recovery, state saving, permission confirmation, manual takeover, rollback, and result verification.
But even with Antigravity 2.0, there is still no such complete mechanism.
If you read the official blog post of Gemini 3, written by Pichai himself at the beginning, what follows is all a comparison table of benchmarks.
But if you look at the official blog posts of OpenAI and Anthropic about their new models now, they are all full of various customer reviews of the models.
It's not that benchmarks are useless. Benchmarks are certainly useful; they are a measuring tool. But if the entire narrative of an Agent product revolves around benchmarks, it means that this model really can't do the job.
Google can't give up on AI Agents because this sector is really profitable. Just look at its competitors, and you'll understand.
In February 2026, within the first week after the independent desktop App of OpenAI's Codex was launched, the number of downloads exceeded 1 million. Just two months later, the weekly active users of Codex reached 4 million.
Needless to say about Claude Code. In the financing materials in February, Anthropic hinted that the annualized revenue of this product had exceeded $2 billion.
It's been more than a month since the release of Antigravity 2.0, and if you visit its official website now, you'll find that there is still no pricing for the enterprise version.
Claude Code can be paid for on a per - user basis through Claude Team, and Codex can be subscribed to through GPT Business or ChatGPT Enterprise, also on a per - user basis.
For Google, if an enterprise wants to use Antigravity 2.0, it can only go through Gemini Enterprise Agent, which will give you some free usage quotas for a try. It can't be turned into a paid product like those of OpenAI and Anthropic.
So I guess that Shazeer and Jumper left Google probably because they were disappointed with the company.
This article is from the WeChat official account "Zimu AI". Author: Miao Zheng, Editor: Wang Jing. Republished by 36Kr with permission.