Google Releases Two "Pocket Rockets" Models as Open Source, Outperform SOTA with 270M Parameters

Google seems to be playing both the main and alt accounts in this move: First, it stirred up the large model battlefield with Gemini. Then, it introduced two "fellow apprentices" for edge devices: One returns with a retro and hardcore architecture, and the other specializes in teaching AI "Stop just chatting and start getting things done." The competition among intelligent agent centers in mobile phones is about to heat up.

Google truly has a profound foundation!

After recently using Gemini 3 Pro + Flash to significantly dampen OpenAI's momentum in the "Large Model" field, Google is now rapidly shifting its focus to the edge-side "Small Model"!

Last night, Google released two new technical blogs in one go, all related to edge-side technology.

One is T5Gemma 2, a dedicated underlying architecture innovation. It's the first open-source multi-modal long-context encoder-decoder model, with the smallest version being 270M–270M.

The other is FunctionGemma, a 270M (270 million parameters) model optimized specifically for function calls, which can run on mobile phones, browsers, and other devices.

Both T5Gemma 2 and FunctionGemma belong to the Gemma 3 family. Compared to "Large Models" like Gemini, Gemma is a "Small Model".

Although both are small models, their relationship is somewhat similar to fellow apprentices in the same school but with different specializations.

T5Gemma 2 focuses on architecture efficiency and multi-modality (a return to the Encoder-Decoder architecture).

FunctionGemma, on the other hand, focuses on agents and tool usage (Function Calling capabilities).

T5Gemma 2 has a different architecture from the currently popular LLMs, which can be seen as "another path" in the field of AI technology.

Paper link: https://arxiv.org/pdf/2512.14856

Google has open-sourced T5Gemma 2: pre-trained models in three sizes: 270M–270M, 1B–1B, and 4B–4B.

Open-source link: https://huggingface.co/collections/google/t5gemma-2

FunctionGemma is a skills variant, which is a specialized training for the model's "skills".

It's somewhat like stripping all knowledge-based abilities from a large model and only retaining the targeted function call capabilities.

Open-source link: https://blog.google/technology/developers/functiongemma/

In-depth Technical Analysis of the T5Gemma Series

Let's first look at the advantages of the "new structure" like T5Gemma 2:

Powerful multi-modal performance: It outperforms Google's own Gemma 3 in multiple benchmark tests.

Comprehensively improved general capabilities: In tasks such as coding, reasoning, and multi-language processing, T5Gemma 2 generally performs better than the corresponding-sized Gemma 3 models.

Excellent long-context capabilities: Compared to Gemma 3 and the first-generation T5Gemma, it has achieved significant improvements in generation quality.

Similar to T5Gemma, T5Gemma 2 may outperform the corresponding-sized Gemma 3 models during the pre-training phase and achieves significantly better performance during the post-training phase.

To understand why Google developed T5Gemma, we need to look at the core context of the current evolution of large model technology routes.

T5Gemma can be regarded as a "classical revival" in the field of large models.

In an era dominated by Decoder-only architectures such as GPT and Llama, T5Gemma 2 represents a return and modernization of the Encoder-Decoder route in the classic Transformer architecture.

The well-known GPT, Gemini, and DeepSeek all adopt the Decoder-only architecture.

GPT series (OpenAI): From GPT-1 to the current GPT-4o, all are Decoder-only.

DeepSeek: Whether it's DeepSeek-V2 or the latest V3, the core is Decoder-only (combined with MoE mixture-of-experts technology).

Llama (Meta): It is currently the benchmark for Decoder-only models in the open-source community.

Gemini (Google): The main models (Pro/Flash) are mainly Decoder-only.

Currently, almost all the well-known super models used for "chatting" are Decoder-only.

Why is T5Gemma 2 considered a "return"?

This brings us to the history of the split of the Transformer architecture.

To understand the "return", we first need to see how they "split" in the first place.

In 2017, when Google published the paper "Attention Is All You Need" and proposed the Transformer, the original architecture was a complete Encoder-Decoder set.

However, later, it split into three schools:

School A: Encoder-only (using only the encoder)

Representative: BERT.

Specialty: It can only "read" but not "write". It is extremely good at multiple-choice questions, classification, and sentiment analysis, but it struggles when asked to write an essay.

School B: Decoder-only (using only the decoder)

Representative: GPT.

Specialty: It can only "guess the next word". Although it doesn't have as comprehensive a view of the context as the Encoder (it can only see the left side, not the right), it has a natural ability to generate text, and people have found that as long as this model is made large enough, it can unexpectedly exhibit intelligence (emergence).

That is, it "accidentally" opened up our current AI era (laughs).

School C: Encoder-Decoder (retaining the full set)

Representatives: T5 (Google), BART.

Specialty: It can both read and write. This is the school that T5Gemma 2 belongs to.

The full name of T5 is Text-to-Text Transfer Transformer, with five consecutive Ts, so it's called T5.

So why did the Decoder-only (GPT school) later dominate the field?

Simple and brute-force training

You just need to feed a large amount of text from the Internet into the model and let it continuously predict the next word (self-supervised learning).

Extremely high upper limit

That is the Scaling Law. People have found that as Decoder-only models become larger, their intelligence improves most significantly, and it's easier to increase computing power in engineering.

The Encoder-Decoder was neglected

Because it has a complex structure (with two sets of parameters), training it is slightly more troublesome than Decoder-only models, and when creating super-large models (with hundreds of billions of parameters), its cost-effectiveness doesn't seem to be as high as that of pure Decoder models.

So only a wealthy company like Google has the energy to return to this classic model and continue investing in research and development.

When the whole world was crazy about Decoder-only models, Google suddenly made a comeback.

Since Decoder-only models are so powerful, why return to the Encoder-Decoder architecture?

Because Google has discovered several weaknesses of Decoder-only models, which are exactly the strengths of Encoder-Decoder models:

The "hallucination" problem (making things up):

Decoder-only (GPT)

It generates text while thinking, and sometimes it gets carried away and can't stop, easily making up nonsense seriously.

Encoder-Decoder (T5)

It follows the principle of "first understand (Encoder) - then write (Decoder)" .

The Encoder forces the model to fully digest your input first, generating a complete "central idea vector", and then lets the Decoder translate it.

This mechanism is inherently more rigorous and produces fewer hallucinations.

Natural advantages in multi-modality

If you want the model to understand images, the Encoder is the best "eye".

T5Gemma 2 can directly feed image signals to the Encoder, which is much smoother than forcing a Decoder-only model to handle them.

Edge-side efficiency (running on mobile phones)

In devices with limited computing power like mobile phones, if you're only doing translation, summarization, or instruction execution, Encoder-Decoder models can often achieve the same results as large Decoder-only models with fewer parameters (less video memory).

The emergence of T5Gemma 2 is not to overthrow GPT but to revive the Encoder-Decoder architecture in specific fields (such as mobile devices, translation, tool calls, and rigorous reasoning).

Google didn't train T5Gemma from scratch but adopted an efficient technology called "Model Adaptation".

The core of this technology is to use a Gemma 2 or Gemma 3 decoder model that has been trained on trillions of tokens as a seed and map its weights to the new encoder-decoder structure.

This approach greatly reduces computational costs while allowing the model to inherit the original language understanding capabilities.

FunctionGemma: The Dedicated Brain for Agents

If T5Gemma represents an innovation in the underlying architecture, then FunctionGemma represents an innovation in function implementation.

FunctionGemma is designed to address the most painful point in the implementation of large models - "not only being able to chat but also being able to perform tasks".

FunctionCalling: When an ordinary model is asked to "set an alarm" or "check the weather", it often makes things up. FunctionGemma, after specialized fine-tuning, can accurately output structured data

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Google releases two "pocket rockets" models as open source. With 270 million parameters, they outperform the SOTA.

In-depth Technical Analysis of the T5Gemma Series

Why is T5Gemma 2 considered a "return"?

FunctionGemma: The Dedicated Brain for Agents