Google's 0.3B New Open-Source Model: Runs Offline on Phone with Just 0.2GB Memory

Play small for big gains! Google's 0.3B model performance closely rivals that of the 0.6B model, making multilingual RAG accessible anytime.

According to a report from ZDONGXI on September 5th, today, Google has open-sourced a brand-new open embedding model called EmbeddingGemma. This model punches above its weight with 308 million parameters and is specifically designed for edge-side AI. It supports the deployment of applications such as Retrieval Augmented Generation (RAG) and semantic search on devices like laptops and mobile phones.

One of the key features of EmbeddingGemma is its ability to generate high-quality embedding vectors with good privacy. It can operate normally even without an internet connection, and its performance is comparable to that of Qwen-Embedding-0.6B, which is twice its size.

▲ Screenshot of the Hugging Face open-source page

Hugging Face link: https://huggingface.co/collections/google/embeddinggemma-68b9ae3a72a82f0562a80dc4

According to Google, EmbeddingGemma has the following highlights:

1. Best in its class: Among open multi-language text embedding models under 500M on the Massive Text Embedding Benchmark (MTEB), EmbeddingGemma ranks the highest. Built on the Gemma 3 architecture, it has been trained on over 100 languages. It is compact in size and can run on less than 200MB of memory after quantization.

▲ MTEB score: EmbeddingGemma's performance is comparable to that of top models twice its size

2. Designed for flexible offline work: It is small, fast, and efficient, offering customizable output sizes and a 2K token context window. It can run on everyday devices such as mobile phones, laptops, and desktops. It is designed to work in conjunction with Gemma 3n to unlock new use cases for mobile RAG pipelines, semantic search, etc.

3. Integrated with popular tools: To make it easy for users to start using EmbeddingGemma, it can already be used with users' favorite tools, such as sentence-transformers, llama.cpp, MLX, Ollama, LiteRT, transformers.js, LMStudio, Weaviate, Cloudflare, LlamaIndex, LangChain, etc.

01. Capable of generating high-quality embedding vectors for more accurate edge-side RAG answers

EmbeddingGemma generates embedding vectors. In the context of this article, it can convert text into numerical vectors to represent the semantic meaning of text in a high-dimensional space. The higher the quality of the embedding vectors, the better the representation of language nuances and complex features.

▲ EmbeddingGemma generates embedding vectors

There are two key stages in building a RAG process: first, retrieving relevant context based on user input, and second, generating well-founded answers based on that context.

To implement the retrieval function, users can first generate an embedding vector for the prompt and then calculate the similarity between this vector and all document embedding vectors in the system. In this way, the text fragments most relevant to the user's query can be obtained.

Subsequently, users can input these text fragments along with the original query into a generative model (such as Gemma 3) to generate relevant answers that fit the context. For example, the model can understand that you need the phone number of a carpenter to fix a damaged floor.

For this RAG process to be effective, the quality of the initial retrieval step is crucial. Poor-quality embedding vectors can lead to the retrieval of irrelevant documents, resulting in inaccurate or meaningless answers.

This is where EmbeddingGemma's performance advantage lies - it can provide high-quality (text) representations, offering core support for precise and reliable edge-side applications.

02. Punching above its weight, with performance approaching that of Qwen-Embedding-0.6B, which is twice its size

EmbeddingGemma offers state-of-the-art text understanding capabilities commensurate with its scale, and it has particularly strong performance in multi-language embedding generation.

Compared with other popular embedding models, EmbeddingGemma performs excellently in tasks such as retrieval, classification, and clustering.

EmbeddingGemma outperforms the gte-multilingual-base model of the same size in tests such as Mean (Task), Retrieval, Classification, and Clustering. Its test results are also approaching those of Qwen-Embedding-0.6B, which is twice its size.

▲ Evaluation results of EmbeddingGemma

The EmbeddingGemma model has 308M parameters, mainly consisting of approximately 100M model parameters and 200M embedding parameters.

To achieve greater flexibility, EmbeddingGemma uses Matryoshka Representation Learning (MRL) to offer multiple embedding sizes within a single model. Developers can use the full 768-dimensional vector for the best quality or truncate it to smaller dimensions (128, 256, or 512) to improve speed and reduce storage costs.

Google has reduced the embedding inference time (256 input tokens) to <15ms on EdgeTPU, breaking the speed barrier. This means that users' AI functions can provide real-time responses, enabling smooth and instant interaction.

Using Quantization-Aware Training (QAT), Google has significantly reduced the RAM usage to below 200MB while maintaining the model's quality.

03. Usable offline and can run on less than 200MB of memory

EmbeddingGemma supports developers in building flexible and privacy-focused device-side applications. It generates document embeddings directly on the device hardware, helping to ensure the security of sensitive user data.

It uses the same tokenizer as Gemma 3n for text processing, thereby reducing the memory footprint of RAG applications. Users can unlock new functions with EmbeddingGemma, such as:

Searching personal files, texts, emails, and notifications simultaneously without an internet connection.

Implementing personalized, industry-specific, and offline-supported chatbots through RAG with Gemma 3n.

Classifying user queries into relevant function calls to help mobile agents understand (user needs).

The following is an interactive demonstration of EmbeddingGemma, which visualizes text embeddings in a three-dimensional space. The model runs entirely on the device.

▲ Interactive demonstration of EmbeddingGemma (Image source: Joshua from the Hugging Face team)

Demo experience link: https://huggingface.co/spaces/webml-community/semantic-galaxy)

04. Conclusion: Small size, big capabilities, accelerating the development of edge-side intelligence

The launch of EmbeddingGemma marks a new breakthrough for Google in miniaturized, multi-language, and edge-side AI. It not only approaches the performance of larger-scale models but also strikes a balance between speed, memory, and privacy.

In the future, as applications such as RAG and semantic search continue to penetrate personal devices, EmbeddingGemma may become an important cornerstone for promoting the popularization of edge-side intelligence.

This article is from the WeChat official account “ZDONGXI” (ID: zhidxcom). Author: Li Shuiqing. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

0.3B, Google's new open-source model, can run even when the phone is offline, and only 0.2GB of memory is enough.

01. Capable of generating high-quality embedding vectors for more accurate edge-side RAG answers

02. Punching above its weight, with performance approaching that of Qwen-Embedding-0.6B, which is twice its size

03. Usable offline and can run on less than 200MB of memory

04. Conclusion: Small size, big capabilities, accelerating the development of edge-side intelligence