When Hundred Billion Parameters Meet a 5mm Chip

Farewell to the worship of parameters and review a breakthrough in computing power in the physical world.

In the past two years, the global tech community has seemed to be caught up in a religious frenzy called the Scaling Law. Under the influence of OpenAI and NVIDIA, everyone's attention has been fixed on the exponential growth of the number of parameters. From 175 billion to trillions of parameters, from H100 to Blackwell, it seems that computing power is justice, and scale is truth. Investors and the media are keen on discussing when GPT-5 will pass the Turing test, as if as long as enough graphics cards are stacked, silicon-based life will naturally emerge in the cloud data centers.

However, behind the rapid development of cloud computing power, the engineering community in the physical world is facing a severe challenge.

You must have had this experience: You shout "Turn off the lights" to your smart speaker, but it takes two seconds to respond, or even replies with "Connecting to the network, please try again later" due to Wi-Fi fluctuations. In that awkward moment, the so-called artificial intelligence performs worse than a five-dollar physical switch.

Consumers may only complain a little about this "cloud dependency syndrome." But for "life-critical" terminals such as autonomous driving, industrial robots, and medical emergency equipment, relying entirely on the "super brain" in the cloud is neither realistic nor safe.

Imagine an autonomous driving car traveling at a speed of 100 kilometers per hour. When it detects an obstacle ahead, if it needs to upload data to a cloud computing center thousands of miles away, wait for the inference to be completed, and then receive the braking instruction back - just the physical latency of the data traveling back and forth in the optical fiber is enough to cause an accident. Not to mention the risk of privacy leakage: Who would like to upload their home camera footage and personal medical records to the public cloud without reservation?

So, in 2025, the technological trend quietly reversed. Compared with the "super brains" in the cloud that are out of reach and consume tens of thousands of dollars in electricity per second, the engineering community began to focus on a more attractive and challenging proposition: edge AI.

This is not a simple "downsizing," but an extremely anti-human engineering battle. We need to "violently slim down" the AGI that consumes the computing power of thousands of graphics cards and stuff it into an edge chip with an area of only a few square millimeters and a power consumption of only a few watts, while still maintaining its "intelligence."

Today, we will strip away the specific commercial packaging and review this "brain science" revolution in chips and algorithms from the perspective of the underlying architecture.

When 140GB Meets the Physical Limit of a Few Hundred Megabytes

Before discussing how to do it, we must first understand the physical limit faced by edge AI, which is a desperate computing power paradox.

The current general large language model (LLM) is a typical "luxury disease" patient, with insatiable demands for resources. Let's look at a set of data: Take a 70-billion-parameter model as an example. If we want to run it, just loading the model weights requires about 140GB of video memory. This is just the "static" occupation. The KV Cache generated during the model's inference process is a memory-devouring beast, and it grows linearly with the increase in the dialogue length.

On the edge side, the reality is cruel. Currently, the dedicated memory for the NPU in mainstream automotive chips, smart home system-on-chips (SoCs), and even the latest flagship mobile phones in your hands is often only a few gigabytes. Some entry-level chips with limited resources may even have only a few hundred megabytes.

To stuff a 140GB behemoth into a few hundred megabytes of space, this is not just "putting an elephant in a refrigerator," but more like "forcing all the books in a national library into a briefcase you carry with you." Moreover, users have put forward an even more demanding requirement: You must accurately find page 32 of any book in this briefcase within 0.1 seconds.

This is the impossible triangle faced by edge AI: high intelligence, low latency, and low power consumption - it's difficult to have all three at the same time.

To break this paradox, the industry has generally reached a consensus: The future AI architecture must be "schizophrenic" - that is, a three-tier hierarchical architecture of "cloud-edge-end."

A single cloud is not fast enough, and a single edge is not powerful enough. The future intelligent system will be divided into different tasks like the human nervous system: The cloud is the "cerebral cortex," deploying a teacher model with hundreds of billions of parameters, responsible for handling extremely complex and non-urgent long-tail problems, such as writing a thesis or planning a long-distance trip. The edge is the "spinal cord" and "cerebellum," running directly on the chips next to the sensors, responsible for high-frequency, real-time, and privacy-sensitive tasks, such as voice wake-up and emergency obstacle avoidance.

But the problem is: Even if it only acts as the "spinal cord," the current chips often can't handle the tasks. How to retain the emergence ability of large models with a very small number of parameters has become the top challenge for algorithm engineers.

The Violent Aesthetics Under Three Scalpels

To run large models on the edge side, algorithm engineers have to act like surgeons and perform a precise operation on the models. This is actually an art of "compromise," finding the delicate balance between accuracy and speed. Currently, the mainstream industry path mainly involves three "scalpels."

The first "scalpel" is knowledge distillation. This is the key for edge models to maintain high intelligence. We don't need the edge model to read all the original Internet data, which requires a huge amount of computing power. We only need it to learn "how to think." So, engineers let the large teacher model in the cloud learn first, extract the core logic, feature distribution, and inference path, and then "teach" them to the small student model on the edge. It's like a professor condensing a million-word academic masterpiece into a few-thousand-word "genius student's notes." Frontline industry practices show that in this way, a small model with 0.5 billion parameters can perform almost as well as a general model with tens of billions of parameters in specific vertical scenarios such as cockpit control and home appliance commands. It may not be able to write poems, but it can definitely understand "Turn up the air conditioner by two degrees."

The second "scalpel" is extreme quantization. This can be said to be the most "violent" aesthetics in the engineering community. General large models usually use FP16 or even FP32 for calculations, with extremely high precision, retaining more than a dozen decimal places. But on the edge side, every bit of storage and transmission consumes power. Engineers have found that large models are actually very "robust," and cutting off some precision doesn't affect the overall situation. So, they use post-training quantization (PTQ) or quantization-aware training (QAT) to directly compress the model weights from FP16 to INT8 or even INT4. This means that a highway that originally needed 16 lanes can now be run on only 4 lanes. The model size is instantly compressed by more than 4 times, and the inference speed is doubled. However, the difficulty lies in "calibration" - how to compress the precision without destroying the model's semantic understanding ability? This requires extremely delicate mathematical tuning to prevent some key outliers from being misjudged.

The third "scalpel" is structural pruning. There are a large number of "redundant" connections in neural networks, just like some neurons in the human brain are not very active. Through structural pruning, the parameters that have little impact on the output results can be directly removed, thereby reducing the amount of calculation at the physical level.

Breaking Down the Memory Wall That Blocks Data

The "slimming down" at the software level is just the first step. The real tough battle lies in the hardware, that is, the chip architecture.

If you ask chip designers what bothers them most about large models, they will probably not say "computation," but "memory access." In the traditional von Neumann architecture, the computing unit and the storage unit are separated. When a large model runs, the data shuttles back and forth between the dynamic random-access memory (DRAM) and the computing unit like vehicles in the morning rush hour.

This is like a chef who can cut vegetables very fast but has to run to the refrigerator in the next room to get a scallion every time he cuts a knife. As a result, the chef spends most of his time running around instead of cutting vegetables. This is the famous "memory wall" crisis. In edge large model inference, more than 80% of the power consumption is not spent on computation, but on "moving data."

This embarrassing situation has led to a new architectural idea: domain-specific architecture (DSA).

We have observed that hard-tech companies like iFlytek and Horizon Robotics, which have been deeply involved in the edge field for many years, have been able to ship hundreds of millions of chips. The core reason is that they no longer blindly believe in the general CPU or GPU architecture, but instead carry out "privileged design" for the Transformer model.

First is the exploration of in-memory computing. Since it's too tiring for the chef to run around, just move the refrigerator into the kitchen, or even install the cutting board directly on the refrigerator door. By minimizing the physical distance between the storage unit and the computing unit, or even performing calculations directly in the static random-access memory (SRAM), the "toll" for data transfer is greatly reduced.

Second is heterogeneous computing scheduling. Inside the SoC, a fine division of labor is carried out: The CPU is responsible for process control, the digital signal processor (DSP) is responsible for signal processing such as noise reduction, and the most intensive matrix multiplication operations are assigned to the highly customized neural processing unit (NPU).

The most crucial is operator hardening. For the core attention mechanism algorithm of large models, the chip design team directly "engraved" the acceleration circuit on the silicon wafer. Although this approach sacrifices some generality, it is extremely efficient in handling large model inferences. This "algorithm-defined chip" strategy enables edge solutions to achieve millisecond-level responses when handling voice wake-up and command recognition. This is not just the technical choice of a single company, but the "optimal solution after compromise" reached by the entire edge AI chip industry to break through the bottleneck of Moore's Law.

From the All-Knowing God to the Skilled Craftsman

In addition to focusing on hardware, another more practical approach is: Acknowledge the limitations of AI and move from "general" to "specialized."

General large models often know a little about everything but are not proficient in anything. They are prone to "hallucinations" and talk nonsense seriously. This may be creative when writing science fiction novels, but it is a disaster in medical diagnosis or industrial control.

At this time, the "platform-based" strategy of manufacturers like SenseTime Medical seems very smart. Facing the pain points of complex data and limited computing power in the medical industry, they didn't try to create an all-knowing and all-powerful "AI doctor." Instead, they built an assembly line to produce various specialized "special forces."

By encapsulating the technology into a "model production platform," hospitals can train specialized models for specific diseases based on their own high-quality data. This idea essentially transforms AI from an "all-round doctor" into a "skilled craftsman."

This kind of "small and beautiful" vertical intelligent agent requires less computing power but provides more reliable diagnostic suggestions. Doctors don't need an AI that can write code and draw pictures. They need an assistant that can accurately read CT scans and quickly organize medical records.

The same logic applies to iFlytek's industrial path: Instead of burning money in the red ocean of general large models, they focus on edge technologies and chips in vertical fields such as healthcare and smart homes, gain data feedback, and then use it to support basic research.

Behind these different paths leading to the same goal is the collective awakening of the entire Chinese AI industry: Stop blindly pursuing the "size" of parameter scale and turn to pursuing the "practicality" of application implementation.

Finally

Under the media spotlight, people are keen to discuss how OpenAI's Sora has shocked the world, or argue about when GPT-5 will pass the Turing test, and always associate AGI with the grand narrative of "destroying humanity."

But in the corners where the spotlight doesn't reach, in Huaqiangbei in Shenzhen, in the industrial parks in Suzhou, and in Zhangjiang in Shanghai, Thousands of engineers are doing more boring but perhaps more subversive work: Bringing down the price of AI and reducing its size.

From the cloud to the edge, from general to vertical, this is not only the evolution of the technical architecture but also the return of AI values.

True "intelligence of all things" doesn't mean that everyone has to be constantly connected to an all-knowing and all-powerful cloud brain like God. Instead, all things - whether it's the air conditioner beside you, the dashboard in your car, or the CT machine in the hospital - should have a small but smart and independent "core."

When a chip that costs only a few dozen dollars can run a large model with logical reasoning ability without relying on a fragile network cable, the singularity of the intelligent era will truly arrive.

Technology shouldn't just be a ghost in the server. It should be embedded in every piece of glass and every chip in our lives in the most hardcore and silent way, like a deep and quiet stream.

This article is from the WeChat official account "Technology Can't Be Cold", written by Balang, and published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

When a hundred billion parameters meet a 5mm chip

When 140GB Meets the Physical Limit of a Few Hundred Megabytes

The Violent Aesthetics Under Three Scalpels

Breaking Down the Memory Wall That Blocks Data

From the All-Knowing God to the Skilled Craftsman

Finally