NVIDIA's Latest Research: Small Models - The Future of Intelligent Agents

Nvidia, which sells computing power, reminds everyone that small models for agents work better.

Researchers from NVIDIA Research and the Georgia Institute of Technology recently published a paper, presenting a non - consensus view:

Small Language Models (SLMs) are the future of AI agents, rather than those large and bulky Large Language Models.

https://arxiv.org/abs/2506.02153v1

The core reasons of the authors mainly fall into the following three points:

Firstly, SLMs are already powerful enough to handle most repetitive and specific tasks in AI agents. Secondly, they are naturally more suitable for the architecture of agent systems, being flexible and easy to integrate. Finally, from an economic perspective, SLMs are more cost - effective and efficient, significantly reducing the operating costs of AI.

Imagine that an AI agent system is like a virtual team that can automatically decompose tasks, call tools (such as browsers or code editors), and ultimately solve problems. Currently, most AI agents rely on LLMs as their "brains" because LLMs have strong chatting abilities and extensive knowledge, and can handle various complex problems.

The paper points out that the AI agent market reached $5.2 billion in 2024 and is expected to soar to $200 billion by 2034, with more than half of enterprises already using them. But here comes the problem: the tasks of AI agents are often repetitive and single - minded, such as "checking emails" and "generating reports". Using LLMs, the "all - rounders", to do these jobs is like using a supercomputer to play Minesweeper or driving a Lamborghini to deliver pizzas - it's a huge waste of resources.

Moreover, it's not just a waste. The characteristics of agent system tasks also allow small models to better adapt to the agent ecosystem, making it more likely to deliver results that meet the requirements.

Because in essence, an AI agent is not a chatbot, but a system of "toolbox + brain". SLMs are small, with low training/fine - tuning costs (just a few hours on a GPU), and can easily adapt to new requirements (such as new regulations). This brings about "democratization", enabling more people to develop agents, reducing bias, and promoting innovation.

The authors point out that agent tasks only use a small part of the skills of LLMs, such as strictly formatted tool calls. And the natural heterogeneity of AI agent systems is conducive to mixing models. The main agent uses an LLM, while sub - tasks use SLMs.

If a large number of SLMs are adopted, the advantage of data collection can also be leveraged: when agents are running, the specialized data generated naturally can be used to fine - tune SLMs, forming a virtuous cycle and making the results better and better.

01 What is an SLM?

The paper gives a practical definition of SLMs:

SLMs: They can fit into ordinary consumer electronic devices (such as mobile phones or laptops), have fast inference speeds, and can serve an agent's requests from a user in real - time. Currently, models with less than one billion parameters can be considered SLMs. Other larger models usually require cloud server support.

Why is it defined this way? SLMs are like "portable brains" and are suitable for terminal deployment; LLMs are like "cosmic - level supercomputers", but they have high latency and high costs.

The authors give an extreme but interesting example as a metaphor: if an AI is infinitely large, it will be limited by the speed of light and unable to interact in real - time; if it is infinitely small, it will be too weak to do anything. The human brain is the balance point, and SLMs are like the human brain at this balance point, being efficient and easy to iterate.

The authors compare two types of agent structures:

On the left, a language model chats with the user and directly triggers each tool, so the entire process is within that single model.

On the right, a small controller program handles tool calls, allowing the model to focus on conversations or specific inferences.

This split makes it easy to insert small language models for most calls, saving costs and reserving large models only for less common open - ended Q&A or inference steps.

The paper shows that the most advanced small models, such as Phi - 3 and Hymba, can match the performance of large models with 30B to 70B parameters in tool use, common - sense reasoning, and instruction following, but the computational volume in the actual workflow is reduced by 10 - 30 times.

NVIDIA also tested three real - world AI agents and found that:

MetaGPT: 60% of the tasks can be replaced by SLMs

Open Operator: 40% of the tasks can be replaced by SLMs

Cradle (graphical interface automation): 70% of the tasks can be replaced by SLMs

02 If SLMs are so good, why still use LLMs?

The paper believes that the main reason why AI agents have not made large - scale use of SLMs is due to path dependence.

A large amount of capital (up to $57 billion) has been invested in centralized large - model infrastructure. Teams tend to reuse paid setups rather than switch to lighter local options, and it's difficult to make a change in the short term.

The bias of "bigger is better" in the industry is still strong. Research on small models has been chasing the same broad benchmarks used for large models, and these tests cannot show the excellent performance of small models in agent tasks.

SLMs hardly have the popularity of GPT - 4, and small models have not gone through rounds of marketing booms like large models. Therefore, many builders have never tried the cheaper and more reasonable route.

In this regard, the paper believes that if researchers and agent builders can do the following, they can well leverage the potential of SLMs in agents.

- Collect and sort data

- Fine - tune SLMs for specific tasks

- Cluster tasks and build the "skills" of SLMs

03 Introduction to Chinese - origin authors

SHIZHE DIAO

According to his publicly available LinkedIn profile, he studied at Beijing Normal University and the Hong Kong University of Science and Technology successively, and once served as a visiting scholar at UIUC.

He once interned at ByteDance AI LAB and joined NVIDIA as a research scientist in 2024.

Xin Dong

According to his personal blog, he graduated with a doctorate from Harvard University. He has worked and interned at companies such as Tencent and Meta.

Yonggan Fu

According to his publicly available LinkedIn profile, he graduated from the University of Science and Technology of China with a bachelor's degree and completed his doctoral studies at Rice University and the Georgia Institute of Technology.

He interned at Meta and NVIDIA and is now a research scientist at NVIDIA.

This article is from the WeChat official account "Facing AI". Author: Hu Run, Editor: Wang Jing. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

NVIDIA's latest research: Small models are the future of intelligent agents.

01

What is an SLM?

02

If SLMs are so good, why still use LLMs?

03

Introduction to Chinese - origin authors