The Future of Embedded Systems Lies in Small Models

In the past two days, a latest research conclusion from NVIDIA has attracted wide attention in the industry: small language models (SLMs) are the future of intelligent agents.

In the past two days, a latest research conclusion from NVIDIA has attracted wide attention in the industry - small language models (SLMs) are the future of intelligent agents. Immediately afterwards, NVIDIA introduced its new small language model: Nemotron-Nano-9B-V2, which achieved the highest performance among similar products in some benchmark tests.

In fact, the trend of small language models (SLMs) has also reached the fields of MCUs and MPUs.

Small models are just "compressed" large models

We may have come into contact with small language models (SLMs) long ago. The parameter range of SLMs varies from millions to billions, while LLMs have hundreds of billions or even trillions of parameters.

SLMs are compressed from LLMs. When compressing a model, it is necessary to reduce the model size while retaining the model's accuracy as much as possible. Common methods are as follows:

Knowledge distillation: Train a smaller "student" model using knowledge transferred from a large "teacher" model;

Pruning: Remove redundant or less important parameters in the neural network architecture;

Quantization: Reduce the numerical precision used in calculations (for example, convert floating-point numbers to integers).

Small language models are more compact and efficient than large models. Therefore, SLMs require less memory and computing power, making them very suitable for resource-constrained edge or embedded devices.

Many small but powerful language models have emerged, proving that size isn't everything. Common SLMs with 1 to 4 billion parameters include Llama3.2 - 1B (a 1-billion-parameter variant developed by Meta), Qwen2.5 - 1.5B (a 1.5-billion-parameter model from Alibaba), DeepSeeek - R1 - 1.5B (a 1.5-billion-parameter model from DeepSeek), SmolLM2 - 1.7B (a 1.7-billion-parameter model from HuggingFaceTB), Phi - 3.5 - Mini - 3.8B (a 3.8-billion-parameter model from Microsoft), and Gemma3 - 4B (a 4-billion-parameter model from Google DeepMind).

Running SLMs isn't just about computing power

For MPUs, running SLMs doesn't seem to be a difficult problem. But for developers, how can they know if an MCU supports running generative AI?

There isn't a single, straightforward answer to this question - however, there is a hard requirement, which is that the neural processing unit (NPU) of the MCU must be able to accelerate the operation of Transformer.

In addition, running generative AI requires a bandwidth system bus for the MCU and a large-capacity, high-speed, tightly coupled memory configuration.

Actually, many people now only use GOPS (billions of operations per second) or TOPS (trillions of operations per second) to compare the raw throughput of microcontrollers. Currently, the best-performing MCUs can provide up to 250 GOPS of computing power, and MCUs used for generative AI will provide at least twice this performance. However, raw throughput isn't an ideal indicator for measuring the actual system performance.

Because successful generative AI applications need to support Transformer operations and will transfer a large amount of data between the system internal, memory, neural processing unit, central processing unit, and peripheral functions such as the image signal processor. Therefore, a system with high raw throughput may theoretically be able to process a large amount of data quickly, but if the system can't transfer data to the neural processing unit quickly, the actual performance will be very slow and disappointing.

Of course, for MPUs, a large bandwidth, tight coupling between memory and the bus are also crucial.

The SLM project of the cooperation between Aizip and Renesas

As early as last August, Aizip and Renesas joined hands to demonstrate ultra-efficient SLMs and compact AI Agents for edge system applications on MPUs. These small and efficient models have been integrated into the Renesas RZ/G2L and RZ/G3S motherboards based on the Arm Cortex - A55.

Aizip created a series of ultra-efficient small language models (SLMs) and artificial intelligence agents (AI Agents) called Gizmo, with scales ranging from 300 million to 2 billion parameters. These models support multiple platforms, including MPUs and application processors for a wide range of applications.

SLMs enable AI agents on device edge applications to provide the same functions as large language models (LLMs), but with a smaller footprint on the edge. On-device models have advantages such as enhanced privacy protection, resilient operation, and cost savings. Although some companies have successfully reduced the size of mobile language models, ensuring accurate tool calls for automated applications on low-cost edge devices remains a major challenge for these SLMs.

It is reported that on the RZ/G2L with a single A55 core running at a frequency of 1.2 GHz, these SLMs can achieve a response time of less than 3 seconds.

MCUs are also increasing their investment in SLMs

Alif Semiconductor recently released its latest series of MCUs and fusion processors - Ensemble E4, E6, and E8, which are mainly aimed at supporting the operation of generative AI models including SLMs. Meanwhile, Alif is the first chip supplier to use the Arm Ethos - U85 NPU (neural processing unit), which supports Transformer-based machine learning networks.

According to the benchmark test results, this series can perform high-energy-efficiency object detection in less than 2 milliseconds, image classification in less than 8 milliseconds, and the SLM executed on the E4 device only consumes 36 mW of power when generating text to construct a story based on user-provided prompts.

The Ensemble E4 (MCU) uses a dual Arm Cortex - M55 core, and the Ensemble E6 and E8 fusion processors are based on the Arm Cortex - A32 core and dual Cortex - M55 cores respectively. It's worth noting that the E4/E6/E8 are all equipped with dual Ethos - U55 + Ethos - U85, with very powerful computing power.

Alif believes that compared with other manufacturers, they started earlier. The first-generation Ensemble MCU series was released as early as 2021, and since then they have been shipping E1, E3, E5, and E7 devices in batches. While other MCU manufacturers are still at the first-generation AI MCUs, Alif released its second-generation products. It is also the industry's first MCU to support Transformer-based networks, which are the foundation of LLMs and other generative AI models.

SLMs will be the future of embedded systems

SLMs significantly compress the model size while retaining the model's accuracy as much as possible. This efficient and compact characteristic makes them a perfect fit for resource-constrained edge and embedded devices, bringing unprecedented intelligent capabilities to these devices.

In fact, the future picture of edge AI is gradually unfolding, and SLMs will also be one of the key areas for MCU and MPU manufacturers to layout.

For example, STMicroelectronics' STM32N6, Infineon's latest-generation PSoC Edge MCU, TI's AM62A and TMS320F28P55x, NXP's i.MX RT700 and i.MX 95, and ADI's MAX7800X have all started to attach importance to NPUs.

Embedded AI was initially mainly a feature of relatively expensive microprocessor-based products running on Linux systems. But soon, the market realized that there is also room for AI in edge and endpoint devices - many of which are based on MCUs. So, in the second half of 2025, advanced MCU manufacturers will include products with AI functions in their product portfolios. The NPUs of these manufacturers are divided into two camps: those using Arm Ethos IP and those using self-developed ones. Currently, the latest Ethos - U85 has started to support Transformer, and the effect of running SLMs was demonstrated half a year ago. Other manufacturers are also constantly following up. In the future, it is believed that SLMs will also completely change the landscape of MCUs and MPUs.

References

[1] IBM: https://www.ibm.com/cn-zh/think/topics/small-language-models

[2] Hugging - Face: https://hugging-face.cn/blog/jjokah/small-language-model

[3] Alif: https://alifsemi.com/comparing-mcus-for-generative-ai-its-not-just-about-the-gops/

[4] Alif: https://alifsemi.com/who-wins-in-the-race-to-make-ai-mcus/

[5] Arm: https://newsroom.arm.com/blog/small-language-model-generative-ai-edge

This article is from the WeChat official account "Electronic Engineering World", author: Fu Bin, published by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Small models are also the future of embedded systems.

Small models are just "compressed" large models

Running SLMs isn't just about computing power

The SLM project of the cooperation between Aizip and Renesas

MCUs are also increasing their investment in SLMs

SLMs will be the future of embedded systems