Actual test of Google Gemma 3n: It shows obvious unevenness in performance, but this is the answer for on-device large models.
To be honest, the domestic large AI model circle has been rather quiet lately.
Let's not even talk about the highly anticipated DeepSeek - R2. Apart from some half - true and half - false gossips, there's been no news at all. It gives the impression that it might not be launched even after half a year.
The so - called "Four AI Dragons" that were fiercely competing last year seem to have lost their steam this year. It's said that they're all quietly working on their own things, but they haven't come up with anything yet. It's like they're trying to study by stealing light through a hole in the wall.
As for the big tech companies, their iteration speed has slowed down, and they're putting more effort into applications. Doubao has launched its 1.6 large model, but the focus of its promotion is more on TRAE and Kouzispace; iFlytek is focusing on AI education and office Agents, while Baidu is promoting full - process AI photo editing and asset management. Each has its own strategy.
Generally speaking, these applications are quite practical, but there aren't really any products that are particularly amazing.
There hasn't been much progress in online large models, and local large models are even more stagnant. Mistral AI, which used to update regularly, has been silent for almost half a year. There's no news at all about large models for mobile devices. For the so - called AI phones that have been hyped for two or three years, over 90% of their functions still rely on the cloud.
(Image source: Google)
Google thought: This won't do. What about my Pixel series?
Last week, Google DeepMind officially announced on Twitter the release and open - sourcing of a new on - device multimodal large model, Gemma 3n.
Google said that the release of Gemma 3n represents a major advancement in mobile device AI. It brings powerful multimodal functions to on - device platforms such as phones, tablets, and laptops, allowing users to experience the high - efficiency processing performance that was previously only available on advanced cloud models.
Is it another case of using a small model to compete with large ones? It's quite interesting.
To see the true capabilities of this model, I (Xiaolei) downloaded the latest model released by Google for testing. Next, I'll share the highlights with you.
Google aims to "compete with large models with a small one"
First, let's answer two questions:
First, what is Gemma 3n?
Gemma 3n is a lightweight on - device large model developed by Google using the MatFormer architecture. It features a nested structure for low - memory consumption design. Currently, Google has launched two models: 5B (E2B) and 8B (E4B). However, through architectural innovation, its VRAM usage is comparable to that of 2B and 4B models, with a minimum requirement of only 2GB.
(Image source: Google)
Second, what can Gemma 3n do?
Different from conventional text - clipping models, Gemma 3n natively supports multiple input modalities such as images, audio, and video. It can not only perform automatic speech recognition (ASR) and automatic speech translation (AST) but also complete various image and video understanding tasks.
(Image source: Google)
Its native multimodal and multilingual design makes it very suitable for mobile devices.
Finally, how can I use Gemma 3n?
Six months ago, deploying an on - device large model on a phone was an extremely complicated task, often requiring the help of a Linux virtual machine. Lei Technology even published a tutorial on this. So, it's reasonable for people to have such questions.
But now, that's no longer necessary.
(Image source: Google)
Google quietly launched a new app last month called Google AI Edge Gallery. It allows users to directly run open - source AI models from the Hugging Face platform on their phones. This is Google's first attempt to bring lightweight AI inference to local devices.
Currently, the app is available for download on the Android platform. Interested readers can go to Github to try it out. After loading the large model, users can use this app to implement conversational AI, image understanding, and prompt word laboratory functions, and even import custom LiteRT format models.
It's that simple. You can directly use the phone's local computing power to complete tasks without an internet connection.
Actual test: More suitable for mobile devices
Next, it's time for the highly anticipated testing session.
As shown in the picture, Google has prepared four models for this app by default. There are Google's own Gemma series and the Qwen series from Tongyi Qianwen. We selected the currently most powerful Gemma 3n - 4B, Qwen2.5 - 1.5B from Tongyi Qianwen, and an additionally deployed Qwen3 - 4B GGUF for testing.
First, the classic strawberry question:
Q: How many letters "r" are there in the word "Strawberry"?
This question may seem simple, but it has actually stumped many large AI models.
In the actual test, Gemma 3n - 4B and Qwen2.5 - 1.5B, which lack in - depth thinking ability, still answered "2". Qwen3 - 4B GGUF, which supports in - depth thinking, was able to give the correct answer "3", but its repeated and unnecessary thinking made it take two and a half minutes to generate the answer, which was quite time - consuming.
(Image source: Lei Technology, from left to right: Qwen2.5, Gemma 3n, Qwen3)
From the results, it's clear that small - parameter models significantly reduce the model's logical thinking ability. Although in - depth thinking can reduce the possibility of AI hallucinations to some extent, it also increases the time required for generation.
Then, a relatively simple misleading question:
Q: What is the previous line of the poem "I plant beans beneath the Southern Hill"?
In fact, this is the first line of Tao Yuanming's poem "Returning to My Fields, Part III", and there is no previous line. This question can test whether these small - parameter models fabricate data to answer questions.
Interestingly, only Qwen2.5 - 1.5B gave the original poem this time, but it didn't give a negative answer. Qwen3 - 4B GGUF answered off - topic, and Gemma 3n - 4B fabricated non - existent verses that didn't even conform to the rhythm of ancient poetry.
(Image source: Lei Technology)
Then, a geographical common - sense question:
Q: A scholar set up a tent in the wild. Suddenly, he encountered a bear. Frightened, he started running. He first ran 10 kilometers south, then 10 kilometers east, and finally 10 kilometers north. To his surprise, he found himself back at the place where he set up the tent. Question: What color was the bear the scholar encountered?
This question mainly tests the model's understanding of special geographical locations and phenomena. The only place that meets the scholar's movement trajectory is the North Pole. So, the bear is naturally a white polar bear.
As a result, after some illogical analysis, Qwen2.5 - 1.5B gave the wrong answer. Gemma 3n - 4B and Qwen3 - 4B GGUF were able to give the correct answer. However, it should be noted that Qwen3 - 4B GGUF didn't fully generate the answer because it consumed too many tokens during thinking, which was a common phenomenon throughout the test.
(Image source: Lei Technology)
Then, a simple text - processing task.
Specifically, I provided an article introduction of about 600 words and hoped that they could give a corresponding article summary.
Among them, both Gemma 3n - 4B and Qwen3 - 4B GGUF were able to complete the task. However, since the original language of Gemma 3n - 4B is English, the summary it gave was also in English, while Qwen3 - 4B GGUF could provide a Chinese article summary.
(Image source: Lei Technology)
As for the Qwen2.5 - 1.5B with the smallest parameters, it couldn't give any response at all.
From the above four - round tests, in terms of text processing and logical reasoning ability, Gemma 3n - 4B and Qwen3 - 4B GGUF are actually quite similar, but Gemma 3n - 4B is significantly ahead in terms of generation speed and response success rate. In - depth thinking is obviously not suitable for local models.
However, Gemma 3n is not just a text - based large model. It's a rare small - parameter multimodal large model.
Although the Google AI Edge Gallery can't currently call the speech recognition function, it does support image recognition. You can click the "Ask Image" option to take a photo or upload a picture and ask questions to Gemma 3n.
(Image source: Lei Technology)
In the actual test, the current Gemma 3n knows nothing about anime characters, and its applications such as flower recognition are not very accurate. It can only recognize relatively common things like food and hardware, and its recognition of elements in pictures is not very precise.
But at least, Gemma 3n has indeed achieved a multimodal design for mobile devices.
Obvious weaknesses but promising future
Well, after several days of testing, it's time to draw a conclusion about Google's Gemma 3n.
Generally speaking, it gives me the impression of "having obvious weaknesses but a promising future".
In terms of basic text - answering and logical abilities, its performance is just average. In some logical tests, it clearly performs worse than Qwen 3 - 4B, which supports in - depth thinking. However, it's still significantly better than the commonly seen Qwen2.5 - 1.5B on phones.
But its advantages are also obvious. It's fast. The response speed of Gemma 3n - 4B is significantly faster than that of Qwen 3 - 4B. Without in - depth thinking, it doesn't require as much performance, runs more stably, and can basically achieve a 100% generation response rate.
(Image source: Google)
As for whether the results are correct... that's a matter of the model's ability.
As for its core selling point - offline image recognition, it does have the ability, but it only stays at the "basic" level. It can recognize objects and extract text, but it's difficult for it to understand complex scenarios. Also, because its native language is English, it may have some bugs when processing complex Chinese, which should be noted.
Generally speaking, Gemma 3n doesn't bring a revolutionary experience. It's more like a cautious compromise between performance and multi - functionality.
This is probably the unique drawback of on - device small models at this stage: It can do a little of