HomeArticle

Just now, Google's AI roadmap was exposed: Is it really going to abandon the attention mechanism? Transformer has a fatal flaw.

新智元2025-06-17 10:51
Google AI Roadmap: Full-modal Transformation of Gemini, New Architecture Needed to Break Through Infinite Context.

The future AI roadmap has been exposed! Google invented the Transformer, but in the roadmap, it admits that the existing attention mechanism cannot achieve "infinite context". This means that the next - generation AI architecture must be "rewritten from scratch". Is the era of the Transformer really coming to an end? What exactly are Google's plans for the future?

Recently, Google's future AI roadmap has been exposed!

Logan Kilpatrick, the product lead at Google, introduced the future of the Gemini model in a speech at the AI Engineers World Expo.

In the future, the full - modality of Gemini will be the focus. The model is gradually evolving into an agent, and its reasoning ability will continue to expand.

Quick overview of the key points ——

· Full - modality (r)

It already natively supports image + audio generation, and video is next.

· Early experiments with Diffusion (r)

Related to diffusion models.

· Agent capabilities by default (m)

It has first - class tool invocation and usage capabilities. More importantly, the model is gradually evolving into an agent.

· Continuous expansion of reasoning ability (s)

One research breakthrough after another is coming.

· More small models (s)

There will be more content to share soon.

· Infinite context (r)

With the current attention mechanism and context - handling methods, this is impossible to achieve. We need to make brand - new innovations at the core architecture level to achieve this goal.

· Large models

Scale is everything.

Note that (r), (s), and (m) indicate the progress of each item in Google's roadmap:

(s)= short: Short - term/Coming soon —— It represents projects that are in progress or about to be launched.

(m)= medium: Medium - term —— Projects that are still in development and will be launched in the next few quarters.

(r)= research: Research/Long - term project —— Projects that are still in the experimental stage or require breakthrough progress before release.

Silicon Valley tech giants are in a fierce competition: A mid - year review of AI achievements

It can be seen that Google is now on a roll. Gemini 2.5 Pro has helped it regain the upper hand and once again proven its leading position in the AI field.

The big V on X, "Chubby", also conducted a "mid - year review" of the Silicon Valley tech giants.

OpenAI

It still holds the leading position. With o3, o3 pro, and the upcoming GPT - 5, its position remains stable. They keep updating regularly and often release AI tools. The growing number of users speaks for itself.

DeepSeek

After achieving considerable success with r1, DeepSeek has successively launched major updates. However, the world is still waiting for its subsequent product r2. There are currently no clues as to how DeepSeek will proceed in the future.

Anthropic

It remains the leader in the software development (SWE) field. If what its CEO said is true, agents and further development will automate all processes in the next few years and be handled by general agents. Currently, Anthropic is focusing on the business field (which can be seen from its lower rate limit) and continues to maintain a strong position.

Google

However, Google may be the biggest winner this year. It has almost leaped from an up - and - comer to a leading position. Gemini has achieved remarkable success. Regular product updates, numerous announcements, including excellent TPU positioning, make Google's future look bright.

Meta

Undoubtedly, Meta has fallen behind. Llama 4 failed, and Behemoth has not been released yet. Zuckerberg has assembled a new super - intelligence team to try to catch up again. Whether Alexandr Wang's joining Meta from Scale AI will be a turning point remains to be seen.

Grok

Grok 3.5 is also about to be launched. It's hard to evaluate at present. Grok is obviously in a favorable position in the Colossus cluster. However, whether it can train a better model remains to be seen.

What major moves will Google, which has received the highest evaluation, make in the next period?

Let's take a close look at Logan Kilpatrick's speech to find the key clues.

It is widely recognized within Google that Gemini 2.5 Pro is a major turning point for Google.

At this conference, Logan Kilpatrick, a former OpenAI member and the product lead of Google AI Studio, gave a speech full of valuable information, revealing many details of the plans for Gemini 2.5 Pro and Google's future Gemini.

There is an interesting story about Logan Kilpatrick: It is said that Gemini's ability to make jokes was completely trained based on his tweets, which is why they are not funny. 🤣

Currently, Logan Kilpatrick is responsible for Gemini API development and AGI research.

In his speech, Logan Kilpatrick quickly covered three parts:

Some interesting release content about Gemini 2.5 Pro;

A review of the progress of Gemini in the past year;

A look into the future —— the model itself, the Gemini App, and the follow - up plans for the developer platform.

Regarding Gemini 2.5 Pro, he believes that it is regarded as a "turning point" both within Google and in the external developer ecosystem ——

It has achieved outstanding results in mathematics, programming, and reasoning, firmly ranking first in all lists.

It has laid a solid foundation for the future of Gemini.

Gemini's vision: "Unified Assistant"

Logan Kilpatrick posed a question: What was the connection between Google's products in the past?

Most people would think of a Google account. However, a Google account itself does not "retain state". Its function is just to allow you to log in to each independent product.

Now, Gemini is becoming the "unified thread" —— the thread that connects all of Google's services.

The Gemini App is very interesting and cool, reflecting how Google thinks about the future of AI products.

He believes that Google's future will look like this:

Gemini will become the unified interface, connecting all Google products to form a real "omnipresent assistant".

Currently, most AI products still require "active user operation" —— you have to actively ask questions and request functions.

But the most exciting part is the next stage of AI:

"Proactive AI" —— AI actively discovers problems for you, provides suggestions, and automatically handles tasks.

Now, Google is fully betting on the new paradigm shift:

Multi - modality capabilities: Native audio processing is already supported in Astra and Gemini Live. The Veo technology leads the industry, and video integration will be the focus of the next stage.

Model evolution: It is evolving from a simple token processor to an agent with systematic reasoning ability. "Reasoning expansion" is particularly worthy of attention.

Architectural innovation: It includes the small - model ecosystem, solutions for infinite context (which need to break through the limitations of the existing attention mechanism), and the amazing token - processing ability demonstrated in early diffusion experiments.

Progress towards a "unified full - modality model"

From the model perspective, Gemini was initially conceived as a unified multi - modality model: capable of handling audio, images, and videos.

In this regard, Google has made great progress:

At the Google I/O conference, Google announced the native voice capabilities of Gemini (text - to - speech TTS, voice synthesis, voice interaction);

It already supports natural conversations, which sound very natural;

These capabilities have been integrated into Astro and Gemini Live.

Astro is Google's research prototype, exploring ways to bring breakthrough capabilities to its products.

Currently, Astro integrates the following capabilities:

Google is also advancing the "Veo" - related capabilities (Video + Other). It has reached the SOTA level in multiple metrics and will be incorporated into the main - line Gemini model in the future.

In addition, Google is researching "diffusion - based reasoning" —— Gemini Diffusion. However, this project is still at the research frontier and has not entered the main line, but its prospects are promising.

Gemini Diffusion has a very high throughput rate, sampling more than 1000 tokens per second.

Agents become the mainstream

Recently, Logan Kilpatrick has been thinking: As the system's reasoning ability becomes stronger, what will be the form of future AI products?

In the past, developers always regarded the model as a black - box tool:

Input tokens, output tokens;

Then build various scaffolding externally to enhance functionality.

But now, the situation has changed:

The model itself is becoming more systematic and more capable of doing things autonomously, no longer just a "passive calculator".

He believes that the "reasoning process" will be a core change point: how to expand the model's reasoning ability.

The question he is very eager to see answered is:

Will much of the scaffolding that was previously done externally be integrated into the model's internal reasoning process in the future? This will completely change the way developers build products.

More roadmap items: small models, large models, infinite context

In addition, Google will also focus on the following new products and research.

More "small models" —— lightweight, suitable for mobile devices and low - power devices;

Larger models —— to meet users' expectations for ultimate capabilities;

More importantly: Research breakthroughs in "infinite context".

One of the major flaws in the current AI model architectures (such as the Transformer) is that they cannot well support infinite context.

Google believes that since the attention mechanism cannot be infinitely expanded, a new structure is necessary.

They are actively exploring: How to enable the model to introduce, understand, and efficiently process ultra - large - scale context.

The key points of the upcoming developer features are as follows.

Embedding models may seem like "early - stage AI tools", but they are still core components. Most RAG applications rely on embeddings. Google is about to release a state - of - the - art Gemini embedding model and make it available to more developers.

Deep Research API Users love the "deep research" function