Running Gemma 4 Locally on iPhone: A Hot Topic - How Close Is the Era of 0 Tokens?

You can run many tasks without spending money on tokens.

Editorial Department of MachineHeart

Google's newly open - sourced model, Gemma 4, a few days ago, gave the industry a big surprise.

It adopts the same technical architecture as Gemini 3, supports native full - modality, ranks third globally on the Arena AI leaderboard, and has multiple models to choose from. Several smaller models —— E2B (effective parameters: 2.3B) and E4B (effective parameters: 4.5B) —— can be directly deployed and run locally on mobile phones, with a context window of up to 128K. It can be said to be a "pocket - sized alternative to Gemini".

As expected, the model quickly became a new toy for mobile phone users after its release.

Among them, a post by an X user was viewed hundreds of thousands of times. He posted a video in the post, telling how he ran Gemma 4 locally on his iPhone, including processing pictures, audio, and controlling the flashlight switch. He said that Gemma 4 was incredibly fast, feeling like magic.

Someone quantified this speed on an iPhone 17 Pro and pointed out that if the phone uses an Apple chip, with the help of MLX (Apple's machine - learning framework) optimized for this chip, the model's inference speed can exceed 40 tokens per second.

Someone also achieved a similar speed on a Samsung Galaxy, even with the thinking mode enabled. This made people exclaim that it was "unbelievably fast".

This speed makes running AI models on mobile phones an acceptable option in the future and is very useful in sensitive scenarios such as healthcare.

The 128K context window also makes these small models more attractive.

So, how to run it specifically? Actually, it's very simple and not exclusive to geeks. Because Google has released an official app —— Google AI Edge Gallery. People who want to experience it on their mobile phones can directly download this app, then download the model version they want to run, and it can be run after opening.

Moreover, since it is officially released by Google, there is no need to worry too much about security issues.

In addition to these small models running on mobile phones, some people have also tried larger versions of Gemma 4 on more powerful hardware. For example, running Gemma 4 Mixture - of - Experts 26B on a MacBook Pro with the M5 Pro version.

In direct conversations, this model is still very fast, and text generation and code interpretation are very smooth.

However, when using Gemma 4 as a coding agent, problems arise. Because running an agent requires a large context (the context window of Gemma 4 26B is 256K), complex prompts, and stable tool calls. Gemma 4 obviously can't handle these well, often getting stuck, reporting errors, or having incorrect output structures.

The turning point occurred when he switched the model to qwen3 - coder. In the same environment, file creation, command execution, and multi - step tasks could all run normally. He believes that the problem is not with the agent framework, but whether the model itself has been optimized for "tool calls + structured output". In this regard, Gemma 4 may not have done enough, or perhaps this developer hasn't found the correct usage.

In addition, some people say that Gemma 4 is a bit useless in terms of intelligence level.

Nevertheless, the emergence of Gemma 4, a "high - performance powerhouse", cannot be underestimated. If in the future, a large number of daily queries, chats, simple inferences, code generation, and image understanding tasks can be run locally without the need to buy tokens, won't the manufacturers selling tokens be in an embarrassing situation?

Of course, the current situation is not that pessimistic. After all, there is still a gap between the currently open - sourced models and the cutting - edge flagship closed - source models. Moreover, most of the competitive open - source models are still limited by hardware capabilities and cannot reach the usable level on the edge side for the time being.

However, the future trend is clear. In the short term, cloud - based closed - source models still lead in the most cutting - edge complex inferences and large - scale multi - agent collaborations. But in the long term, as hardware continues to improve and quantization technology continues to optimize, edge - side models will gradually erode the high - frequency simple tasks in the cloud.

Those manufacturers that only rely on selling tokens and API subscriptions will have to compete more fiercely in the "truly difficult" parts —— super - strong agents, ultra - long and reliable contexts, and proprietary capabilities that require a large amount of real - time data.

Gemma 4 is just the beginning. The next surprise may be that a certain edge - side model makes users completely unaware of the difference between "local" and "cloud" in daily use. When that day comes, the business model of the entire AI industry will undergo a real reshuffle.

This article is from the WeChat official account "MachineHeart" (ID: almosthuman2014). The author is MachineHeart. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Running Gemma 4 locally on the iPhone has become a hot topic. How far is the era of 0 tokens?