HomeArticle

Running the full version of Gemma 3n with 2GB of RAM. The world's first model under 10 billion parameters dominates LMArena: Scoring 1300 points to crush the record.

AI前线2025-06-27 21:06
On June 26th local time, after making its debut in preview at last month's Google I/O, Google has now officially released the full version of Gemma 3n, which can run directly on local hardware.

On June 26th local time, after its first preview at Google I/O last month, Google has now officially released the full version of Gemma 3n, which can run directly on local hardware.

“Can't wait to see the performance of these Android models!” a developer said after the official release.

The Gemma series is a set of open - source large models launched by Google. Different from Gemini: Gemma is targeted at developers and can be downloaded and modified, while Gemini is Google's closed - proprietary model, focusing more on performance and commercialization.

It is reported that the officially released Gemma 3n now has the ability to input images, audio, and video, supports text output, and can run on devices with a minimum of 2GB of memory. It is said to perform better in tasks such as programming and reasoning. Specifically, the main update highlights include:

Born with multimodal design: Natively supports the input of images, audio, video, and text, as well as text output.

Edge - side optimized design: Gemma 3n focuses on operating efficiency and offers two sizes based on “effective parameters”: E2B and E4B. Although their original parameter counts are 5B and 8B respectively, through architectural innovation, their memory usage during operation is only equivalent to that of traditional 2B and 4B parameter models, and they can run with a minimum of 2GB (E2B) and 3GB (E4B) of memory.

As for benchmark tests, the E4B model of Gemma 3n has become the first model with a parameter scale below 10B to score over 1300 in the LMArena assessment, outperforming Llama 4 Maverick 17B, GPT 4.1 - nano, and Phi - 4.

Is it effective?

“Gemma 3n also has the most comprehensive initial release among any models I've seen: Google has partnered with 'AMD, Axolotl, Docker, Hugging Face, llama.cpp, LMStudio, MLX, NVIDIA, Ollama, RedHat, SGLang, Unsloth, and vLLM', so there are dozens of ways to try it now.” said Simon Willison, the co - creator of Django Web.

Willison ran two versions on a Mac laptop respectively. On Ollama, the 7.5GB version of the 4B model drew such a picture:

Then, he used the 15.74GB bfloat16 version of the model and got the following picture:

“There is such a significant visual difference between the 7.5GB and 15GB model quantizations.” Willison said. He also pointed out that the Ollama version does not seem to support image or audio input yet, but the mlx - vlm version does.

However, when asked to describe the above picture, the model misrecognized it as a chemical diagram: “The picture is a cartoon - style illustration depicting a molecular structure on a light blue background. The structure consists of multiple elements of different colors and shapes, connected by curved black lines.”

In addition, netizen pilooch praised that the model is fully compatible with all previous operations based on Gemma3. “After I integrated it into the visual - language model fine - tuning script, the program started smoothly (using HF Transformer code). When running LoRa fine - tuning on a single GPU, the E4B model only occupies 18GB of VRAM with a batch size of 1, while Gemma - 4B requires 21GB. The Gemma3 series launched by DeepMind is really good and ranks first among open - source visual - language models.”

Some developers also said, “I've been trying out the E4B in AI Studio, and the results are very good, much better than expected for the 8B model. I'm considering installing it on a VPS so that I have other options instead of using those expensive APIs.”

In the test of developer RedditPolluter, E2B - it can use the Hugging Face MCP, but he had to increase the context length limit from the default “~4000” to “more than” to prevent the model from getting into an infinite search loop. It can use the search function to obtain information about some newer models.

Of course, there are still doubts about the practical use of small models. “I've done many experiments, and any model smaller than 27B is basically useless, except as a toy. For small models, I can only say that they can sometimes give good answers, but that's not enough.”

In response, some netizens said, “I've found that the best use case for micro - models (< 5B parameters) is as a reference tool when there's no WiFi. When writing code on a plane, I've been using Qwen on my MacBook Air instead of Google Search, and it's very effective for asking basic questions about syntax and documentation.”

What are the core technical capabilities?

The MatFormer architecture is the core

Google specifically pointed out that the core of its high efficiency lies in the new MatFormer (Matryoshka Transformer) architecture, which is a nested Transformer designed for elastic reasoning. It's like a “Russian nesting doll”: a larger model contains a smaller but complete sub - model inside. This design allows a model to run in different “sizes” for different tasks, achieving a dynamic balance between performance and resource usage.

This design extends the concept of “Matryoshka Representation Learning” from the embedding layer to all components of the entire Transformer architecture, greatly enhancing the model's flexibility and adaptability in different resource environments.

During the training of the 4B effective parameter (E4B) model with the MatFormer architecture, the system will simultaneously optimize a 2B effective parameter (E2B) sub - model, as shown in the above picture.

This architectural design brings two key capabilities to developers:

Pre - extracted model, ready to use . Developers can freely choose the complete E4B main model for stronger performance according to the application scenario, or directly use the pre - extracted E2B sub - model. On the premise of ensuring accuracy, the E2B achieves up to 2 times the inference speed, which is especially suitable for edge devices or scenarios with limited computing power.

Mix - n - Match customized model. For different hardware resource limitations, developers can freely customize the model size between E2B and E4B through the Mix - n - Match method. This method builds models of various sizes by flexibly adjusting the hidden dimension of the feed - forward network in each layer (e.g., from 8192 to 16384) and selectively skipping some layers.

Meanwhile, Google has also launched the auxiliary tool MatFormer Lab to help developers quickly select and extract the model configuration with the best performance based on multiple benchmark test results (such as MMLU).

Google said that the MatFormer architecture also lays the foundation for “elastic reasoning”. Although this capability has not been officially launched in this release, its design concept is already taking shape: a single deployed E4B model will be able to dynamically switch between the E4B and E2B reasoning paths during runtime in the future, optimizing performance and memory usage in real - time according to the current task type and device load.

The key to significantly improving memory efficiency

In the latest Gemma 3n model, Google has introduced an innovative mechanism called Per - Layer Embeddings (PLE). This mechanism is designed and optimized for edge - side deployment, which can significantly improve the model quality without increasing the high - speed memory usage required by device accelerators (such as GPUs/TPUs).

In this way, although the total parameter counts of the E2B and E4B models are 5B and 8B respectively, PLE allows a large part of the parameters (i.e., the embedding parameters distributed in each layer) to be efficiently loaded and calculated on the CPU. This means that only the core Transformer weights (about 2B for E2B and about 4B for E4B) need to be stored in the usually limited accelerator memory (VRAM).

Significantly improve the processing speed of long contexts

In many advanced edge - side multimodal applications, processing long - sequence inputs (such as content generated by audio and video streams) has become a core requirement. To this end, Gemma 3n has introduced the KV Cache Sharing mechanism, which speeds up the generation of the “first token” in long - text reasoning, especially suitable for streaming response scenarios.

Specifically, KV Cache Sharing optimizes the Prefill stage of the model: in the middle layer, the intermediate Key and Value from the local and global attention mechanisms are directly shared with all upper - layer structures. Compared with Gemma 3 4B, this improves the Prefill performance by up to 2 times.

A new visual encoder to improve multimodal task performance

Gemma 3n has launched a new and efficient visual encoder: MobileNet - V5 - 300M, to improve the performance of multimodal tasks on edge devices.

MobileNet - V5 supports multiple resolutions (256×256, 512×512, 768×768), making it convenient for developers to balance performance and image quality according to their needs. It is trained on large - scale multimodal data and is good at handling various image and video understanding tasks. In terms of throughput, it can achieve a real - time processing speed of up to 60 frames per second on Google Pixel devices.

This performance breakthrough is due to multiple architectural innovations, including advanced modules based on MobileNet - V4, a deep pyramid architecture that can be expanded up to 10 times, and a multi - scale fusion visual - language model adapter. Compared with the undistilled SoViT in Gemma 3, MobileNet - V5 - 300M achieves up to 13 times the speed increase (after quantization) on the Google Pixel Edge TPU, with a 46% reduction in parameters, a 4 - fold reduction in memory usage, and a significant increase in accuracy.

Support voice recognition and voice translation

In terms of audio processing, Gemma 3n is equipped with an advanced audio encoder based on the Universal Speech Model (USM), which can generate a token for every 160 milliseconds of speech (about 6 tokens per second) and integrate it into the language model as input, thus providing a more detailed voice context representation. This unlocks voice recognition and voice translation functions for edge - side applications.

It is reported that Gemma 3n performs particularly well in the conversion between English and Spanish, French, Italian, and Portuguese. At the same time, when performing voice translation tasks, combining the “chain - of - thought prompting” strategy can further improve the translation quality and stability.

Reference links:

https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/

https://simonwillison.net/2025/Jun/26/gemma-3n/

This article is from the WeChat official account “AI Frontline”, compiled by Chu Xingjuan, and published by 36Kr with authorization.