106B Model Achieves Breakthrough with Distributed RL Using Only 512 H200s and Gets Open - Sourced Network

INTELLECT-3 is released, featuring a 106B parameter MoE model and an open-source RL technology stack.

[Introduction] INTELLECT-3, released by Prime Intellect, has achieved the strongest performance among models of the same scale in multiple benchmark tests such as mathematics and code. This model aims to open up the technology stack for training cutting - edge models to the community, promoting the popularization and development of large - scale RL research.

Recently, Prime Intellect officially released INTELLECT-3.

This is a Mixture-of-Experts (MoE) model with 106B parameters, trained based on Prime Intellect's Reinforcement Learning (RL) technology stack.

In various benchmark tests of mathematics, code, science, and reasoning, it has achieved the strongest results among models of the same scale, even surpassing many larger cutting - edge models.

Prime Intellect has open - sourced the entire training process, including model weights, training framework, datasets, RL environment, and evaluation system, hoping to promote more open research on large - scale reinforcement learning.

The training software and infrastructure used by INTELLECT-3 are exactly the same as the version that will be open to everyone on the Prime Intellect platform.

This means that in the future, everyone and every company will be able to have the ability to post - train the most advanced models.

Multiple Benchmarks, Achieving SOTA

INTELLECT-3 is a 106B - parameter Mixture-of-Experts (MoE) model, which has undergone supervised fine - tuning (SFT) and reinforcement learning training based on GLM 4.5 Air.

It has achieved the strongest performance among models of the same size in Benchmarks of mathematics, code, science, and reasoning.

Training Framework

During training, Prime Intellect used the following core components:

PRIME - RL: A self - developed distributed RL framework that supports supervised fine - tuning and reinforcement learning of large - scale MoE models.
Verifiers and Environments Hub: A unified environment interface and ecosystem for various agent - based RL environments and evaluations.
Prime Sandboxes: A high - throughput, secure code execution system for agent code - based environments.
Computing power orchestration: Scheduling and management are completed on 512 NVIDIA H200 GPUs across 64 interconnected nodes.

INTELLECT-3 is fully trained end - to - end using PRIME - RL.

This framework is deeply integrated with the Verifiers environment, supporting the entire post - training system from synthetic data generation, supervised fine - tuning, reinforcement learning to evaluation.

Through a close connection with the Environments Hub, the training system can smoothly access the continuously expanding collection of environments and evaluation tasks.

The most prominent feature of PRIME - RL is its full distribution (async - only).

The research team confirmed this during the development of the previous generation, INTELLECT-2:

The future of RL must be distributed, that is, always in a slightly off - policy state.

Because in long - term agent rollouts, distribution is the only way to avoid speed bottlenecks and truly expand the training scale.

In the past six months, the research team has focused on a large number of ablation experiments on performance, stability, and large - scale efficiency. INTELLECT-3 is the result of these studies.

Prime Intellect will also provide hosted PRIME - RL on the upcoming Lab platform, allowing visitors to conduct large - scale RL training without dealing with complex infrastructure.

Training Environment

The training environment of INTELLECT-3 is built by the Verifiers library and hosted on the Environments Hub, which is Prime Intellect's RL environment and evaluation center for the community.

Verifiers is a leading open - source tool for building RL environments and evaluation tasks for models.

It provides modular and extensible components, enabling complex environment logic to be described in a concise way while maintaining extremely high performance and throughput.

Traditional RL frameworks usually tightly bind the environment to the training repository, making version management, ablation, and external contributions inconvenient.

The Environments Hub publishes Verifiers - based environments as independent, version - lockable Python modules with a unified entry point, allowing tasks to be independently versioned, shared, and continuously iterated.

All environments and evaluations used by INTELLECT-3 have been made public on the Environments Hub.

To support reinforcement learning, Prime Intellect has significantly expanded and upgraded its self - developed Sandboxes infrastructure.

To safely execute external code in thousands of concurrent rollouts, a container orchestration layer with sub - second startup and millisecond - level execution latency is required.

Although Kubernetes provides the underlying capabilities, the conventional architecture cannot meet such high - speed training requirements.

Prime Sandboxes can bypass the Kubernetes control panel and communicate directly with pods through Rust, achieving latency close to that of local processes. It can start within 10 seconds even under large - scale concurrency, and each node can stably run hundreds of isolated sandboxes.

In Verifiers, researchers parallelize sandbox startup and the model's first - round inference, completely eliminating the perceptible waiting time before code execution.

Computing Power Scheduling

The researchers deployed 512 NVIDIA H200 GPUs across 64 interconnected nodes.

The biggest engineering challenge is how to maintain determinism and synchronization in a distributed system where hardware failures may occur.

Resource Preparation: Use Ansible for infrastructure - as - code, automatically discover hardware, and conduct InfiniBand pre - checks to isolate slow or faulty nodes.
Scheduling: Ensure that tasks can exit cleanly through Slurm + cgroup v2 without leaving residual processes that occupy GPU memory.
Storage: Use Lustre to provide high - throughput training I/O and NVMe NFS for fast metadata and convenient SSH storage.
Observability: Monitor through DCGM + Prometheus to quickly detect and take offline unstable nodes before problems escalate.

Training Scheme

INTELLECT-3 mainly consists of two stages:

Supervised fine - tuning based on GLM - 4.5 - Air and large - scale RL training.

Both stages and multiple rounds of ablation experiments were run on 512 H200 GPUs, lasting a total of two months.

The researchers trained diverse RL environments covering categories such as mathematics, code, science, logic, in - depth research, and software engineering to improve the model's reasoning and agent capabilities.

All environments have been made public on the Environments Hub.

All benchmark tests also provide standardized and verified implementations.

In the future, the key focuses of Prime Intellect's work include:

Expanding Agent - Based RL: The researchers will continue training, with more emphasis on agent environments, expecting further improvements on more tasks.
Rich RL Environments: The Environments Hub already has over 500 tasks, covering research, computer use, theorem proving, automation, and professional fields. INTELLECT-3 has only used a small part of them. The next step is to let RL cover more and higher - quality community tasks.
Long - Term Agents: The researchers are enabling the model to self - manage context (such as pruning context, branch reasoning, and maintaining lightweight external memory), making long - term behavior truly trainable through RL. In the future, they will also explore environments that specifically reward long - term reasoning.

Prime Intellect is building an open super - intelligence technology stack, putting the ability to train cutting - edge models in everyone's hands.

INTELLECT-3 also proves that even non - large laboratories can train models that can compete with top - tier teams.

Reference: https://www.primeintellect.ai/blog/intellect-3

This article is from the WeChat official account "New Intelligence Yuan". Editor: Yuanyu. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Using only 512 H200s, the 106B model breaks through with distributed RL and is open-sourced across the network.

Multiple Benchmarks, Achieving SOTA

Training Framework

Training Environment

Computing Power Scheduling

Training Scheme