Lao Huang enters the OpenClaw battlefield, and the most powerful open-source "lobster" model is approaching Opus 4.6.
The world's most valuable company has also entered the OpenClaw battlefield!
Last night, NVIDIA officially unveiled its new generation of "open - source model" Nemotron 3 Super, specifically designed for large - scale AI agents.
It has a total of 120 billion parameters, 12 billion activation parameters, and a context of 1 million tokens. Its inference speed has tripled, and its throughput has increased by fivefold.
Nemotron 3 Super adopts the innovative Mamba - MoE hybrid architecture, which completely solves the performance bottleneck in multi - agent collaboration.
Moreover, it is the first model in the "Nemotron 3 family" to achieve the following three breakthroughs:
Natively pre - trained using NVFP4 precision;
A brand - new LatentMoE hybrid expert architecture that optimizes the "accuracy per unit of computing power" and "accuracy per unit of parameters" to the extreme;
The introduction of the MTP (Multi - Token Prediction) layer, which boosts the inference speed through native "speculative decoding".
On the Pinchbench benchmark, Nemotron 3 Super leads the way and firmly holds the top position in the open - source field.
In terms of the OpenClaw task success rate, it achieved a high score of 85.6%, with performance approaching that of Claude Opus 4.6 and GPT - 5.4.
It can be said that the "strongest open - source model" perfectly adapted to OpenClaw has been born!
Today, the pre - training and post - training datasets of over 10 trillion tokens for Nemotron 3 Super, the complete training methodology, and 15 reinforcement learning environments are all open - sourced.
Address: https://huggingface.co/collections/nvidia/nvidia - nemotron - v3
NVIDIA's 120 - billion parameter behemoth makes a grand entrance, a perfect match for OpenClaw
Now, as the chatbot stage evolves towards multi - agent applications, there are usually "two walls" to overcome.
The first is the context explosion.
The number of tokens generated by multi - agent workflows is up to 15 times higher than that of regular conversations.
Because each interaction requires resending the complete historical record, including tool outputs and intermediate reasoning processes.
When performing long - cycle tasks, this large amount of context data not only drives up costs but also easily leads to goal drift, that is, gradually deviating from the initially set goals of the agent.
The second is the "thinking tax".
Complex agents must perform reasoning at each step. However, if the LLM is called for each subtask, the cost of multi - agent applications will become extremely high and the response will be slow, making it difficult to implement in practical applications.
To address this, NVIDIA's open - source Nemotron 3 Super completely shatters the "two shackles" of agent applications.
Paper address: https://research.nvidia.com/labs/nemotron/files/NVIDIA - Nemotron - 3 - Super - Technical - Report.pdf
As mentioned above, Nemotron 3 Super has a context of 1 million tokens.
Especially in the OpenClaw environment, AI can fully retain the entire workflow state in memory, ensuring logical consistency from the first step to the last step.
On Artificial Analysis, Nemotron 3 Super has refreshed the SOTA and topped the efficiency and open - source rankings.
Among open - source models of the same scale, the new model's accuracy is far ahead.
Meanwhile, the NVIDIA AI - Q research AI agent powered by the new model has taken the top spot on the DeepResearch Bench and DeepResearch Bench II rankings.
In the next five years, NVIDIA will invest $26 billion to build the world's top open - source models
Hybrid architecture revolution, throughput soars by fivefold
This time, NVIDIA has reconstructed the underlying architecture of Nemotron 3 Super.
The 88 - layer network uses a periodic alternating arrangement, where the Mamba - 2 layer is responsible for efficient sequence modeling, providing linear time complexity.
A small number of Transformer attention layers are interspersed as "global anchors", responsible for long - distance information routing across positions and high - precision reasoning.
As a result, compared with the previous generation of Nemotron Super model, the throughput has increased by up to fivefold, and the accuracy has increased by up to twofold.
Compared with GPT - OSS - 120B and Qwen3.5 - 122B, Nemotron 3 Super has achieved the highest scores.
Moreover, when the input sequence length is 8k and the output sequence length is 64k, its throughput is 2.2 times and 7.5 times higher than that of GPT - OSS - 120B and Qwen3.5 - 122B respectively.
LatentMoE: Expert design that understands hardware, squeezing every byte of accuracy
More importantly, Nemotron 3 Super has introduced "Latent MoE" for the first time.
The solution of LatentMoE is very ingenious. Before routing and expert calculation, the tokens are first projected from the hidden dimension d to a smaller latent dimension ℓ. Both routing and expert calculation are carried out in this much smaller dimension.
This means that the expert parameters to be loaded and the cross - card communication volume are directly reduced by a factor of d/ℓ!
The saved resources can be used to increase the total number of experts and the number of activated experts by the same factor. It's like getting a boost in accuracy for free, with almost no change in inference cost.
NVIDIA's official blog puts it more intuitively: Spend the computing cost of 1 expert to activate 4 experts.
Compared with traditional MoE, LatentMoE outperforms in both parameter utilization and computing power utilization.
Multi - Token Prediction: Achieving both performance and inference efficiency
Nemotron 3 Super has also added a powerful feature: Multi - Token Prediction (MTP), which achieves both model quality and inference efficiency.
The traditional training method is to "predict the next token", but MTP requires the model to predict several future tokens at once at each position.
This actually forces the model to understand the causal relationship between multiple steps and the longer text structure.
Facts have proven that this method is very effective, and the model's validation set loss and downstream scores have seen real improvements.
In addition to becoming smarter, the greatest advantage of MTP is the implementation of native speculative decoding.
These additional prediction heads are equivalent to a "draft model" built into the model.
During inference, the prediction heads quickly create a draft (generate candidates for the next few tokens), and then the main model verifies all these drafts in a single forward pass.
This method significantly reduces the generation latency, and compared with an external independent draft model, the additional computing power overhead (FLOPs) it brings is negligible.
Native NVFP4 precision pre - training
As NVIDIA's vice president of research, Bryan Catanzaro, said, Nemotron 3 Super is designed for Blackwell.
During the pre - training phase, the team used NVFP4 precision throughout on the Blackwell platform, significantly reducing the video memory requirements.
Moreover, without any loss of accuracy, the inference speed of the new model is four times faster than that of FP8 on the Hopper architecture.
25 trillion tokens + 21 RL environments, targeting AI agents
Like the previous Nemotron 3 Nano, Nemotron 3 Super was also trained on 25 trillion token text data.
The entire pre - training is divided into two steps:
Phase 1: Consume 80% of the data (20 trillion tokens), focusing on data diversity and a wide range of knowledge. The corpus covers 16 major categories, including web crawls, code, mathematics, academic papers, and multilingual data;
Phase 2