Offline Intelligence: When Will the DeepSeek Moment Finally Arrive?

As cloud-based models make rapid progress, how can we achieve truly offline intelligence?

In the past two years, stories about AI models have mostly revolved around two versions: the all - powerful cloud and the imaginative edge.

Once, a widely - envisioned industry blueprint was that as the capabilities of lightweight models continued to strengthen, it seemed only a matter of time before AI could break free from the constraints of the cloud and achieve ever - present, offline - free personal intelligence on everyone's devices.

However, after the hype, an embarrassing reality lies before us: whether it's the recently popular AI toys or the highly - anticipated AI glasses, their core interactions and intelligence still rely firmly on the cloud. Even for mobile phones and PCs with stronger computing power, only a handful have truly achieved offline AI capabilities.

In technology demonstrations, edge - side models seem all - powerful. But why, in the end, can't the promised offline intelligence work without the internet?

On one hand, there is users' extreme desire for a better experience: instant response without waiting, privacy data not to be transmitted, and no disconnection during network outages. On the other hand, there is the inevitable "physical ceiling" of edge devices—limited computing power, power consumption, and memory, like an invisible wall, cruelly blocking the implementation of most high - performance models.

A deeper contradiction lies in the pull of business. For tech giants with the most powerful models, the cloud is not only a benchmark to showcase technological leadership but also a toll - collecting station for huge profits. When all the attention and resources are focused on the cloud, the edge, which is more difficult, more labor - intensive, and has less clear commercial returns, naturally becomes an overlooked corner.

So, what exactly are those few who are truly committed to promoting "offline intelligence" doing? At this year's World Artificial Intelligence Conference (WAIC), a company called RockAI gave its answer. They are walking on a less - traveled path and have found the key to break the deadlock.

With the mission of "enabling every device to have its own exclusive intelligence," this team delved into underlying technologies. They even boldly abandoned the mainstream Transformer architecture and managed to tackle the "mission impossible" of edge - side deployment. In the early days, their model could run completely on a Raspberry Pi with limited computing power. This credit - card - sized computer has always been a strict test for edge - side deployment. Most similar models would get stuck after generating just a few sentences on it.

The Yan 2.0 Preview launched at this year's WAIC has only 3 billion parameters but can already achieve multi - modality and true "memory" locally: the model can dynamically adjust weights, retain and update user preferences over the long term.

The results of this "mission impossible" are not just limited to laboratory demonstrations. Mass - production orders have come in from both domestic and overseas markets, quickly converting technological strength into commercial value.

Their story may answer the fundamental question: when cloud - based models are advancing rapidly, why do we still need and how can we achieve true offline intelligence?

GeekPark interviewed Zou Jiasi, the co - founder of RockAI, to talk about the business story behind RockAI.

Why don't we have a never - offline personal AI yet?

Question: The entire industry seems to be working towards a future of offline intelligence, and tech giants like Apple regard it as a core strategy. But why can't we bridge the "last mile" from technology demonstrations to consumers' hands?

Zou Jiasi: Everyone is talking about offline intelligence and edge - device AI. However, between the ideal and the reality, there are two almost insurmountable mountains: one is computing power, and the other is power consumption.

For large models to run on devices, high - end computing power configurations are required. Currently, many AI companies in the industry, although they have models with relatively small parameters, still need chips with higher computing power to run them.

For example, one of our customers wanted to configure an offline large model on their mobile phones. However, the solutions proposed by other large - model manufacturers in the industry almost invariably required the use of Qualcomm's latest flagship chips and more than 16G of memory. But in reality, most smart devices do not have such high - end computing chips.

This is the cruel computing - power gap: No matter how advanced your AI technology is, if it can only be applied to a few top - of - the - line devices, it loses the meaning of inclusive AI.

The other mountain is power consumption.

This problem is most evident in mobile phones. In reality, whenever mobile phone manufacturers try to deploy large models, the devices overheat severely. This is almost a common problem for all models based on the traditional Transformer architecture. Almost all mainstream mobile phone manufacturers have discussed this pain point with us. They all want to make breakthroughs in the next - generation AI phones but are blocked by this power - consumption wall.

Why can't we bridge the last mile?

The fact is, the pace of hardware updates is objectively slow. Many devices were sold years ago, and the chips, storage, microphones, and cameras at that time were not designed for today's large models. Deploying the Transformer architecture on these mid - and low - end computing devices either won't work or will produce poor results.

Even if upstream manufacturers launch new - generation high - end chips, it usually takes 6 - 12 months to integrate them into new product lines. And it usually takes an additional 1 - 2 years for the products to become popular, achieve large - scale shipments, and be widely adopted. This pace is an objective physical reality and cannot be skipped.

Question: You just mentioned that many problems, whether related to computing power or power consumption, ultimately point to the current mainstream Transformer architecture. The Transformer has proven itself to be the most powerful AI architecture in the cloud. Why does it fail to adapt when moved to edge devices?

Zou Jiasi: This question really hits the core of the challenges of running large models on edge devices. The power of the Transformer lies in its revolutionary Attention mechanism. But this is also where the problem lies.

Traditional AI models are like assembly - line workers, processing information one by one in sequence, with limited memory and often forgetting what they processed earlier. In contrast, the Transformer is like a super - capable commander. Instead of processing information sequentially, it arranges information in a matrix and requires each word in the matrix to "shake hands" with all other words to calculate their correlations.

This "global hand - shaking" ability gives the Transformer extraordinary understanding capabilities. But in the cloud, you have unlimited computing power to support such calculations.

However, the design of mobile phone chips (CPU/NPU) is more like the aforementioned "assembly line," which is good at executing tasks at high speed and in sequence. Suddenly asking it to complete a task that requires "global hand - shaking"—where the computational load increases exponentially with each additional word—it will be at a loss.

We noticed this problem from the very beginning. There are currently some improvement solutions in the industry, such as Flash Attention and linear attention. But our conclusion is that these are just minor fixes in the "command center" and do not fundamentally change the high - energy - consumption mode of "global hand - shaking."

We finally chose a more radical path: Retain the powerful feature - extraction ability of the Transformer but completely remove the energy - consuming Attention mechanism and replace it with a new architecture that is more suitable for running on the "assembly line." The Mamba architecture abroad also saw a similar direction. Instead of patching up an F1 car that is not suitable for small roads, we redesigned an off - road vehicle that can run fast on those roads.

Question: This sounds very complicated. Just to run on smart hardware, you have to redesign an architecture. Is offline intelligence really that necessary?

Zou Jiasi: This is an interesting question. We believe it is very necessary, and we have indeed seen strong market demand.

Its necessity lies in several values that cannot be replaced by the cloud:

First, absolute privacy and security. This is the core reason why companies like Apple invest in edge - side technology. The most sensitive data, such as your photo albums, health information, and chat records, should never leave your device. This is a matter of principle.

Second, ultimate real - time interaction. Many scenarios require millisecond - level latency. For example, for a drone deployed with the Yan architecture, when the user shouts "Take a photo when I jump," the model must respond instantly. In such scenarios, any network fluctuation can be fatal, and you cannot rely on the cloud. Another example is future robots, which need to make precise movements based on their unique arm lengths and sensor parameters. This real - time control, which is highly bound to the hardware, must be completed by the local "brain."

Third, cost issues. The prices of cloud - based APIs seem to be constantly decreasing and are even free, but there are still costs. Take cameras as an example. The shipment volume is in the hundreds of millions. At such a large scale, no matter how cheap the cloud is, when multiplied by hundreds of millions, it becomes an astronomical figure. By moving towards offline intelligence, the hardware cost has already been paid, and there are almost no additional costs for subsequent use. From a business perspective, for a large number of devices, local deployment is definitely the most cost - effective solution.

A local model is like a smart butler guarding the door. It respects privacy, ensures security, and understands you personally. Even if it may not be able to solve all the most complex problems, it should be able to handle 80% of daily chores—opening apps, setting reminders, simple translations, meeting minutes, etc.—quickly and securely. For most users, they don't need to handle complex tasks all the time.

Just as Huaqiangbei and brand - name products can coexist. Brand - name products are very important, but Huaqiangbei also has its place. Cloud - based models can meet users' high - end needs, but edge - device models can meet most of users' needs faster, more securely, and more cheaply.

What should a model capable of offline intelligence look like?

Question: You mentioned that to achieve offline intelligence, you chose the most difficult path—redesigning an "off - road vehicle." So, what is the "engine" of this new vehicle, that is, the core mechanism of your new architecture?

Zou Jiasi: Our core innovation is to abandon the high - energy - consumption Attention mechanism of the Transformer that requires "global hand - shaking" and return to a lighter "feature - suppression - activation" architecture, combined with partition activation, to reduce the number of parameters actually calculated each time to one - tenth or even lower. The computing power requirement is reduced to less than one - fifth of the original, and the power consumption is reduced to one - tenth. As mentioned before, in the standard Transformer architecture, all parameters must be fully activated to obtain a highly intelligent answer, no matter how small the task is. However, the human brain does not work this way.

The human brain actually has 80 - 90 billion neurons. We can think of it as a model with 80 - 90 billion parameters. If the human brain were fully activated, the power consumption might reach 3000 watts or even 4000 watts, but its actual power consumption is less than 30 watts.

How does the human brain do this miraculously? It's through partition activation. Our model borrows this approach.

In addition to reducing power consumption, the new architecture also enables us to achieve multi - modality in a 3 - billion - parameter model.

To use a not - very - precise analogy, when you see a bird, hear its chirping, and read the word "bird" at the same time, your whole brain doesn't light up. Instead, specific and small - scale neurons are activated in different partitions such as the visual, auditory, and language areas. It is the independent yet overlapping activation of these partitions that helps us efficiently align forms, sounds, and words perfectly.

Transformer models with less than 3 billion parameters, due to their global - calculation nature, have difficulty efficiently processing and aligning modal information from different sources. Our brain - like activation mechanism is closer to the brain's partition - processing mode, and different modal inputs can naturally activate different partitions, making alignment easier and more precise. Therefore, even with a 3 - billion - parameter model, we can still retain strong joint understanding capabilities of text, speech, and vision.

Question: The idea of "partition activation" is really ingenious. But the human brain can activate only a small part because it is a giant model with almost a trillion parameters. Our current edge - side models only have a few billion parameters. We are already working with limited resources. Can we really expect a small model to achieve better intelligence by activating an even smaller part?

Zou Jiasi: Your question exactly touches on the core of the current development paradigm of large models—what we call the dilemma of compressed intelligence.

Today's pre - trained large models are essentially a process of compressed intelligence—like a huge sponge. The training process is to compress a vast amount of internet data (water) into a container composed of hundreds of billions of parameters. The more parameters there are, the larger the sponge, and the more knowledge it can absorb and store.

This paradigm has some problems when dealing with multi - modality. People who have compressed files know that a 1G text file after compression is smaller than a 1G video or image file. Video and image files are large in size and have a low compression ratio. This is why small - parameter Transformer models in the market have difficulty adding multi - modality capabilities.

So, if the only rule of the game is to compare who has a larger sponge or who has memorized more knowledge, then small - parameter models really have no future.

But we believe that true intelligence should not only be about compression but also about growth and learning. This is the fundamental difference in our approach: we don't stick to one path but pursue both compressed intelligence and autonomous learning in parallel.

The significance of the partition activation we mentioned earlier lies not only in energy conservation but also in providing the possibility for growth.

Our current model has only 3 billion parameters. But through the fine dynamic partitioning of the neural network, for example, dividing it into 100 partitions, only 30 million parameters need to be activated at a time. This means that in the future, within the limits of mobile phone memory, we can make the total number of parameters of the edge - side model very large, such as tens of billions or more. By activating only a very small part of them, we can maintain the same low power consumption.

This overturns the rules of the game. We are no longer researching how to make large models smaller but how to make models grow from small to large on the edge.

So, while others are competing in the field of compression, we have found a second and, in our opinion, more life - like growth path for edge - side models through the MCSD architecture, partition activation, and memory neural units—sustainable and low - cost autonomous learning . We are not just building a model that can run on edge devices; we are constructing a new and ever - growing brain foundation for the future of edge - side AI.

Question: You mentioned the term "autonomous learning." How should we understand the autonomous learning of the Yan model? How is it different from the personalization of current cloud - based models?

Zou Jiasi: Autonomous learning is one of the most exciting technological breakthroughs we want to showcase at this year's WAIC.

Currently, all cloud - based large models we know need pre - training to update their intelligence. Because the real learning process of a model—understanding user feedback and reflecting it in the changes of its neural network—depends on the processes of forward propagation (inference/guessing) and backward propagation (learning/correction). And backward

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Offline Intelligence: When Will the DeepSeek Moment Arrive?