HomeArticle

Don't let Michelin-starred chefs peel potatoes. NVIDIA uses the "cerebellum to control the cerebrum" to reconstruct AGI productivity.

新智元2025-12-12 09:30
The first step of composite AI

Considering that large models consume excessive computing power, NVIDIA's 8B model Orchestrator acts as a "tool assembler", reducing costs and increasing efficiency by combining tools. With 30% of the budget, it achieved a score of 37.1% on the HLE.

Recently, NVIDIA Research found that with proper fine - tuning, small models are sufficient to "command" large models.

NVIDIA's research team's new model, Orchestrator, is an 8 - billion - parameter (8B) model. It not only has higher accuracy and lower costs than previous tool - using AI agents but also can precisely align with users' preferences in tool selection.

In the HLE benchmark test, Orchestrator scored 37.1%, surpassing GPT - 5 (35.1%) and improving efficiency by 2.5 times.

In the tau2 - Bench and FRAMES tests, Orchestrator also led GPT - 5 by a large margin, and its cost is only about 30% of the latter's.

In multiple indicators, Orchestrator achieves the best balance between performance and cost and can excellently generalize to unseen tools.

Preprint link: https://arxiv.org/abs/2511.21689

Why is "powerful model + tools" still not good enough?

Facing extremely difficult comprehensive reasoning exams like Humanity’s Last Exam (HLE), today's large models, although they "know a little about everything", struggle when it comes to in - depth reasoning + cost control.

It is difficult for a single large model (such as GPT - 5) to call basic tools like search engines and code interpreters while being accurate, cost - effective, and controllable at the same time.

To save money, the industry's first reaction is: Don't rely on the most powerful model for everything. Create a "dispatcher" to assign tasks.

However, when mainstream large models are used as dispatchers, the result is ironic:

When GPT - 5 is used for dispatching, 98% of the requests still go back to GPT - 5 or GPT - 5 - mini;

When Qwen3 - 8B is used, when it is unsure, 73% of the tasks are simply handed over to GPT - 5.

In other words, we thought we had created a "dispatcher", but in fact, we just hired an additional "front - desk receptionist for transferring calls".

Tasks assigned to different models after using different models as dispatchers

The results show that simply using prompt words cannot turn common large models into qualified dispatchers.

ToolOrchestra decouples "intelligence" from a single model and reconstructs it into a composite system of a "light - weight scheduling center + heterogeneous - ability toolset", forming a new paradigm of model - tool collaboration.

Next, let's see how Orchestrator is trained.

Orchestrator: Multi - round execution and custom RL

Imagine that previous large models were like high - end restaurants, relying entirely on the "Michelin - starred chef" (GPT - 5) to cook from start to finish - controlling the heat, knife skills, and plating, all by one person.

What was the result? Due to the high cost of a single token, the total cost skyrocketed.

NVIDIA's newly launched "assembled - meal" model is like a central kitchen. A smart "dispatching store manager" (the 8B small model Orchestrator) stays in the center. The manager doesn't cook but:

Asks the "Sichuan cuisine restaurant" on the corner (Qwen - Math - 7B) to stir - fry twice - cooked pork (solve math problems);

Hires a "Cantonese dim - sum chef" (Coder - 32B) to steam a basket of shrimp dumplings (write code);

When unsure, calls the Michelin - starred chef (GPT - 5) to taste and set the flavor.

Architecture diagram of Orchestrator

The 8B small model Orchestrator for dispatching will use reinforcement learning. According to the user's declared preferences, the system will automatically prefer locally deployed models.

The reward function in the training process can be divided into three parts:

1. Result: Whether the answer is correct. +1 for a correct answer, 0 otherwise. The judgment is given by GPT5;

2. Efficiency: Corresponding to the monetary cost and time delay;

3. Alignment with the user's tool preference vector.

The sum of these three parts is the objective function of reinforcement learning. The finally trained Orchestrator can balance, follow commands, and save money.

Orchestrator also includes a human - like step - by - step solving mechanism:

Through the COT thinking chain, Orchestrator analyzes the current state and plans the next structured tool call;

Then, it executes in the environment (such as mathematical derivation and code execution output) and returns the result;

This multi - round cycle is another innovation of Orchestrator.

With the training method, this research also built ToolScale to support the training of reinforcement learning.

As the first large - scale, verifiable synthetic dataset for multi - round tool calls, ToolScale automatically constructs simulation environments (including databases + tool APIs) in 10 fields (finance, medical, aviation, etc.) through large models and then generates 430,000 tasks with manually marked optimal tool - call trajectories.

Overview of the ToolScale data synthesis process

Each task needs to meet three - fold verification:

  • Execution correctness (consistent database state change)
  • Process fidelity (key information is mentioned)
  • Operation completeness.
  • These data are used to train Orchestrator.

The "pragmatic revolution" of AGI

In three high - difficulty benchmarks, Orchestrator - 8B comprehensively surpasses existing methods and significantly reduces reasoning costs:

In the HLE (Humanity's Last Exam), it achieved an accuracy rate of 37.1% (compared to GPT - 5's 35.1%), and the cost was only 9.2 cents (30% of GPT - 5's);

In the τ2 - Bench (strict function - call test), it had an 80.2% correct rate, and only about 40% of the steps called GPT - 5;

In FRAMES (factual reasoning), it scored 76.3% (compared to the SOTA's 74.2%), and the delay was reduced to 8.2 minutes (41% of GPT - 5's).

Comparison of model performance and cost after Orchestrator's dispatching.

Compared with powerful single large - language - model systems, Orchestrator achieves the best cost - effectiveness:

Further analysis reveals that its excellent performance stems from its rational division - of - labor ability:

Orchestrator calls low - cost tools such as local retrieval, Math - 7B, and Qwen - 32B as needed and only calls GPT - 5 (1.95 times per question) at key steps;

If GPT - 5 is used for dispatching, an average of 5.23 times of GPT05 - mini need to be called to solve a question.

Calling low - cost models to solve less - complex problems is the root cause of Orchestrator's ability to reduce costs and increase efficiency.

Comparison of the proportions of Orchestrator calling different tools

Orchestrator also shows strong generalization ability: When facing unseen models during training (such as Gemma - 3 - 27B, Codestral - 22B) or new pricing strategies (DeepInfra), its performance only fluctuates slightly, proving that it has learned a general strategy of tool - ability abstraction and cost - benefit balance rather than over - fitting a specific configuration.

When meeting user preferences, Orchestrator also outperforms other large models, which proves that Orchestrator has customizable, controllable, and interpretable tool - dispatching ability.

The first step of composite AI

In recent years, the AI world has been telling the same story: First, create a general - purpose brain as large as possible, and then use prompt words and a small number of samples to temporarily "dress it up" as experts in translation, writing, programming, etc.

However, as research progresses, this story is starting to change:

More and more "composite AI systems" in which multiple models and tools work together have advantages over single large models in terms of security, speed, and cost, and even surpass them in terms of capabilities.

In summary, in the face of the high cost and high energy consumption of using large models, Orchestrator shows that by separating the "decision - making power" from the "execution power" and no longer relying on a single super - model to save the world, it can completely open up a new path to an efficient, controllable, and scalable practical AGI system.

ToolOrchestra marks the first step towards building a truly intelligent composite AI system, representing a new paradigm that is emerging and aims to replace the single - AI architecture.

Small language models will ultimately become the key core for realizing scalable intelligent - agent AI.

References:

https://arxiv.org/abs/2511.21689

https://developer.nvidia.com/blog/train-small-orchestration-agents-to-solve-big-problems/

https://research.nvidia.com/labs/lpr/ToolOrchestra/

This article is from the WeChat official account "New Intelligence Yuan". The author is New Intelligence Yuan. It is published by 36Kr with authorization.