Stanford und NVIDIA führen Reinforcement Learning bei Tests ein: Feinabstimmung von Open-Source-Modellen schlägt Spitzen-Closed-Source-Modelle, nur für ein paar hundert US-Dollar
There are new advancements in the continuous learning of large models!
The latest research from institutions such as Stanford and NVIDIA presents a new approach to solving open scientific problems:
Test-Time Training to Discover (TTT - Discover).
Based on the open - source model gpt - oss - 120b, it achieves the state - of - the - art (SOTA) in multiple fields and outperforms human experts and proprietary top - tier models.
This approach abandons the "Test - time Scaling" method, which only controls a frozen model through prompt - scheduling.
Instead, in the testing phase, Reinforcement Learning (RL) is used for a single specific problem to update the model weights.
This "Training at Test Time" enables the model to learn from failed attempts in real - time, update the parameters, and thus achieve a targeted improvement in the model's capabilities.
Mathematics: A new bound for the Erdős Minimum Overlap Problem is given, and an autocorrelation inequality is proposed.
Kernel Engineering: On GPU mode, it is twice as fast as the best human engineers.
Algorithms: It scores the highest in the tasks of previous AtCoder competitions.
Biology: It reaches the state - of - the - art in the denoising of single - cell RNA - Seq data.
Reinforcement Learning at Test Time
Overall, the core idea of this study is to apply Reinforcement Learning at Test Time, which is mainly manifested in two aspects:
1. Learning Objective
In contrast to traditional Reinforcement Learning, which aims to maximize the "average reward" for all tasks to achieve generalization, TTT - Discover uses an entropic objective function (Entropic Objective).
It adjusts the weights so that it prefers the action with the highest reward (instead of the entire trajectory).
The core goal here is to find an optimal solution (One Great Solution), not multiple average solutions.
2. Search Subroutine
A reuse mechanism inspired by PUCT is introduced to store historical attempts in a buffer. It preferentially expands the state with the highest potential (highest reward), while also considering exploration.
This design is chosen because in scientific discoveries, the goal is to find an optimal solution to a specific problem that goes beyond the existing knowledge (training data), not to find patterns in the known data distribution to achieve generalization.
Based on this understanding, AI must constantly try to learn from failed experiences in specific tests and find the specific data distribution for this problem.
Here comes a key logical point: If there are no existing training data, what should the large model train on?
TTT - Discover solves this problem as follows: The model constantly generates actions and receives feedback from the environment. Thousands of attempts (including many failed attempts) are stored in the buffer.
These attempts generated by the model's own search form a "private dataset" for this specific problem. This mechanism of "generating data during practical use" completely solves the problem of the lack of training data for problems outside the known distribution (OOD).
Currently, such test - time learning approaches usually rely on Test - time Search, that is, by prompting a frozen Large Language Model (LLM), multiple attempts are made, similar to a human trying to "blindly guess" the solution to a problem.
The problem, however, is that although these methods can store the attempts in the buffer and generate new prompts with heuristic rules, the weights of the LLM itself are not updated, and thus the model's capabilities are not improved.
To enable continuous learning, TTT - Discover updates the weights based on test - time training and searches for the best solution to a single problem.
At the algorithmic level, TTT - Discover generates actions based on a strategy both during the search and learning process and derives the state transition function from the problem statement.
In each step, TTT - Discover performs the following actions:
- Selection: Selects the most promising existing approach from the buffer as the starting point.
- Generation: Generates a new attempt (code and thought process).
- Evaluation: Evaluates the result of the attempt.
- Update: Updates the model weights so that it prefers the best ideas.
- Repetition: Repeats this process and finally returns the best solution found by the system.
Entropic Objective Function and PUCT - Inspired State Selection Mechanism
In practical applications, however, traditional Reinforcement Learning methods still have significant limitations:
On the one hand, the objective function optimizes the average performance and is not sensitive to the improvement of the best solution, while scientific discoveries aim to reach a maximum.
On the other hand, each attempt starts from scratch, resulting in a too short effective time horizon and limiting the depth of a single trajectory.
Reusing existing solutions can equivalently extend the time horizon, and discovery problems do not require robustness against a fixed distribution of initial states.
Moreover, the strategy in the balance between exploration and exploitation tends to converge on conservative but safe actions with high rewards and may lose diversity when reusing states through simple priority selection, which inhibits potential breakthroughs.
To solve these problems, an entropic objective function (Entropic Objective) and a state selection mechanism inspired by PUCT are introduced.
Through the entropic objective function, the training goal is explicitly directed to prefer the action with the highest reward, not the trajectory with the highest average reward.
In addition, a KL penalty term is introduced to shape the advantage function and thus maintain exploration activity while strengthening actions with high advantages.
When selecting the initial state, an evaluation function inspired by PUCT is used:
In contrast to previous work that uses the average, the maximum reward value of the child states is used in Q(s): It's about "how well one can do from this state", not the average performance.
The prior term P(s) encodes the intuition that states with high rewards are more likely to have states with high rewards as successors.
Thus, the model can establish a more scientific balance between exploitation and exploration: It quickly approaches the performance limit by aiming for high rewards, and at the same time prevents itself from getting trapped in a local optimum by rewarding exploration.
Specifically, the model goes through a cycle of "from the known to the unknown" in each training step:
It first selects the most promising starting point from the buffer, generates and evaluates a new attempt, and then immediately updates the weights based on the result, so that the model acts more intelligently in later attempts.
In the experimental phase, the study is based on the open - source model gpt - oss - 120b and is carried out via the Tinker - API. The test cost for a single problem is about a few hundred US dollars. Remarkably, TTT - Discover is about twice as fast as the best human solutions* in kernel programming, as shown at the beginning.
Overall, TTT - Discover shows that introducing targeted learning in the test phase instead of just searching enables a medium - sized open - source model to demonstrate excellent capabilities in solving complex, out - of - distribution (OOD) scientific problems.
However, it must be noted that TTT - Discover is currently mainly suitable for scenarios with continuous (verifiable) rewards. Further work must extend it to problems with sparse rewards, binary rewards, and non - verifiable areas.
Introduction of the Core Authors of the Study
The first and co - first authors of the study are Mert Yuksekgonul and Daniel Koceja.
Mert Yuksekgonul is currently studying for his doctorate in the School of Computer Science at Stanford University. His supervisors are Carlos Guestrin and James Zou.
Daniel Koceja is currently working as a full - time researcher at the Stanford Artificial Intelligence Laboratory (SAIL) under the leadership of Yu Sun.
Yu Sun is the corresponding author of this study. He is currently a postdoctoral researcher at Stanford University and a researcher at NVIDIA.
He obtained his doctorate at UC Berkeley and was a student of Alexei Efros and Moritz Hardt.
Yu Sun's research direction is continuous learning (continual learning), especially Test - time Training, and he has been continuously researching these topics since 2019.
Reference Links
[1]https://github.com/test - time - training/discover
[2]https://www.alphaxiv.org/abs/2601.16175
[3]https://openreview.net/profile?id=~Yu_Sun1