Google's Open-Source Edge-Side Model with Only 2GB Video Memory Refreshes Arena Record, Natively Supports Images and Videos

The first to reach 1300 points among those below 10B.

Google's open-source model has a new update.

Early this morning, Google officially announced Gemma 3n, which natively supports multiple modalities such as text, images, audio, and video.

In the large model arena, Gemma 3n scored 1303 points, becoming the first model below 10B to exceed 1300 points.

There are two models of Gemma 3n, 5B (E2B) and 8B (E4B). However, through architectural innovation, its VRAM usage is comparable to that of 2B and 4B models, with a minimum requirement of only 2GB.

Some netizens said that Gemma 3n's ability to achieve such performance with low memory usage is of great significance for edge devices.

Currently, Gemma 3n is available on Google AI Studio or third-party tools such as Ollama and llama.cpp. The model weights can also be downloaded on Hugging Face.

At the same time, Google also disclosed some technical details of Gemma 3n. Let's take a look at them next.

Matryoshka-style Transformer Architecture

In the two models of Gemma 3n, E2B and E4B, Google proposed the concept of "effective parameters". Here, "E" stands for effective.

The core of Gemma 3n is the MatFormer (Matryoshka Transformer) architecture, which is a nested Transformer structure specifically built for elastic inference.

Its structure is just like its name, resembling a Russian matryoshka doll - a larger model contains smaller, fully functional versions of itself.

MatFormer extends the concept of "Russian matryoshka representation learning" from simple embeddings to all Transformer components.

Under this structure, MatFormer can optimize the E2B sub - model simultaneously when training the E4B model.

To achieve more precise control according to specific hardware limitations, Google also proposed the Mix - n - Match method. By adjusting the hidden layer dimension of the feed - forward network of each layer (from 8192 to 16384) and selectively skipping certain layers, it can slice the parameters of the E4B model, thereby creating a series of custom - sized models between E2B and E4B.

For this function, Google will also release the tool MatFormer Lab for retrieving the best model configuration.

Designed Specifically for Edge Devices

The original parameter counts of the E2B and E4B models of Gemma 3n are 5B and 8B respectively, but their consumption is comparable to that of 2B and 4B models. This low - memory consumption design aims to better adapt to edge devices.

For this purpose, the Gemma 3n model uses the Per - Layer Embedding (PLE) technology, which can significantly improve the model quality without increasing the memory usage.

PLE allows a large portion of the parameters (embeddings related to each layer) to be loaded and efficiently computed on the CPU. In this way, only the core Transformer weights need to be stored in the accelerator memory (VRAM).

In addition, to shorten the generation time of the first token for better handling of long - sequence inputs, Gemma 3n introduced KV Cache Sharing.

Specifically, Gemma 3n optimized the way of model pre - filling. It directly shares the Key and Value from the intermediate layers of the local and global attention mechanisms with all top - level layers. Compared with Gemma 3 - 4B, the pre - filling performance has been improved by 2 times.

Native Support for Multiple Modalities

Gemma 3n natively supports multiple input modalities such as images, audio, and video.

In the audio part, Gemma 3n uses the Advanced Audio Encoder based on USM. USM converts every 160 milliseconds of audio into a token and then integrates it as an input for the language model.

It supports Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST), enabling high - quality speech - to - text transcription directly on the device and translating spoken language into text in another language.

The audio encoder of Gemma 3n supports processing 30 - second audio segments at the time of release. However, the underlying audio encoder is a streaming encoder that can handle audio of any length through additional long - audio training.

In terms of vision, Gemma 3n uses a new and efficient visual encoder, MobileNet - V5 - 300M.

It supports processing resolutions of 256x256, 512x512, and 768x768 pixels on the edge. The processing speed on Google Pixel reaches 60 frames per second, and it performs excellently in various image and video understanding tasks.

MobileNet - V5 is based on MobileNet - V4, but its architecture has been significantly expanded. It uses a hybrid depth pyramid model, which is 10 times larger than the largest MobileNet - V4 variant. It also introduces a novel multi - scale fusion VLM adapter.

Regarding the technical details behind MobileNet - V5, Google will release a technical report later, introducing the model architecture, data expansion strategy, and the underlying data distillation technology.

Reference Links:

https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide/HuggingFace:

https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4

This article is from the WeChat official account “Quantum Bit”. Author: Cressey. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

With a minimum of only 2GB video memory, Google's open - source edge - side model has refreshed the arena record and natively supports images and videos.

Matryoshka-style Transformer Architecture

Designed Specifically for Edge Devices

Native Support for Multiple Modalities