HomeArticle

Stanford and NVIDIA introduce reinforcement learning at test time: Fine-tuning open-source models outperforms top closed-source models with just a few hundred dollars.

量子位2026-01-27 17:16
Solve out-of-distribution scientific problems and achieve continuous learning of large models!

New progress in continuous learning of large models!

The latest research from institutions such as Stanford and NVIDIA proposes a brand - new approach to solve open scientific problems:

Test-Time Training to Discover (TTT-Discover).

Based on the open-source model gpt-oss-120b, it has achieved SOTA in multiple fields, outperforming human experts and closed-source cutting-edge models.

This method no longer follows the practice of "Test-time Scaling", which only schedules a frozen model through prompts.

Instead, during the testing phase, it introduces Reinforcement Learning (RL) to update the model weights for a single specific problem.

This "Test-time Training" enables the model to obtain real-time experience from failed attempts at the problem, update parameters, and achieve targeted evolution of the model's capabilities.

Mathematics: It gives a new bound for the Erdős minimum overlap problem and proposes an autocorrelation inequality.

Kernel Engineering: It is twice as fast as top human engineers on GPUMode.

Algorithms: It achieved the highest score on past AtCoder competition questions.

Biology: It achieved SOTA in the single-cell RNA-seq denoising task.

Reinforcement Learning at Test Time

Overall, the core idea of this paper is Reinforcement Learning at Test Time, which is mainly reflected in two aspects:

1. Learning Objective

Different from traditional reinforcement learning, which focuses on improving the "average reward" of all tasks for generalization, TTT-Discover adopts the Entropic Objective.

It adjusts the weights to favor the action with the highest reward (rather than the entire trajectory).

The core goal here is to produce one great solution, rather than multiple mediocre ones.

2. Search Subroutine

It introduces a reuse mechanism inspired by PUCT, maintains historical attempts in the buffer, prioritizes expanding the state with the most potential (highest reward), and also takes exploration into account.

This design is because, for scientific discovery, the goal is to find the best solution beyond existing knowledge (training data) in a specific problem, rather than finding patterns in the known data distribution for generalization.

Based on this understanding, AI needs to keep trying in specific tests, learn from failed experiences, and find the specific data distribution for the problem.

A key logic is involved here: If there is no ready-made training data, what should the large model train on?

TTT-Discover implements this by having the model continuously generate actions and receive environmental feedback, and store thousands of attempts (including a large number of failure records) in the buffer.

These attempts generated by the model's own search constitute a "private dataset" for the specific problem. This mechanism of "producing data while in actual combat" completely solves the dilemma of having no data to train for out-of-distribution (OOD) problems.

Currently, the general idea of this kind of test-time learning is through Test-time Search, that is, making multiple attempts by prompting a frozen large language model (LLM), similar to humans trying to "blindly guess" the solution to a homework problem.

However, the problem is that although these methods can store attempts in the buffer and generate new prompts using heuristic rules, the weights of the LLM itself are not updated, and the model's own capabilities do not improve.

Therefore, to achieve continuous learning, TTT-Discover updates the weights based on test-time training and aims to find the best solution for a single problem.

At the specific algorithm level, to generate better solutions, both the search and learning processes of TTT-Discover use strategies to generate actions, and the environmental transition function is induced by the problem description.

In each step, TTT-Discover performs the following operations cyclically:

  • Select: Choose the most promising existing solution from the buffer as the starting point.
  • Generate: Generate new attempts (code and thinking process).
  • Evaluate: Evaluate the results of the attempts.
  • Update: Update the model weights to favor the best ideas.
  • Loop: Repeat this process and finally return the single best solution found by the system.

Entropic Objective and PUCT Reuse Strategy

However, in specific implementation, traditional reinforcement learning methods still have obvious limitations:

On the one hand, the objective function optimizes the average performance and is not sensitive to whether the optimal solution is refreshed, while scientific discovery focuses on breaking the maximum value.

On the other hand, each attempt starts from scratch, resulting in a short effective time horizon and limiting the depth of a single trajectory.

Reusing existing solutions can equivalently extend the time horizon, and discovery-type problems do not require robustness to a fixed initial state distribution.

In addition, in the balance between exploration and exploitation, the strategy can easily converge to conservative but safe high-reward actions, and may also lose diversity due to naive priority sorting when reusing states, suppressing potential breakthroughs.

To address the above problems, the research introduces the Entropic Objective and the PUCT-inspired state selection mechanism.

Through the Entropic Objective, the training objective is explicitly guided to favor the action with the maximum reward, rather than the trajectory with the highest average reward.

At the same time, the research also introduces a KL penalty term to shape the advantage function, maintaining the necessary exploration ability while strengthening high-advantage actions.

In the selection of the initial state, a scoring function inspired by PUCT is adopted:

Different from previous work that uses the average value, here the maximum reward of child nodes is used in Q(s), focusing on "how well one can go from this state" rather than the average performance.

The prior term P(s) encodes the intuition that high-reward states are more likely to give rise to high-reward successor states.

Thus, the model can establish a more scientific balance between Exploitation and Exploration: quickly approaching the performance limit through high-reward guidance while preventing getting stuck in local optima through exploration rewards.

Specifically, the model goes through a cycle of "from the known to the unknown" in each training step:

First, select the most promising starting point from the buffer, generate and evaluate new attempts, and then immediately update the weights based on the results, making the model perform more intelligently in subsequent attempts.

In the experimental phase, the research is based on the open-source model gpt-oss-120b and runs through the Tinker API. The testing cost for a single problem is about several hundred dollars. It is worth mentioning that, as shown at the beginning, in the kernel writing task, TTT-Discover is about twice as fast as the current best human implementation*.

Overall, TTT-Discover shows that introducing targeted learning during the testing phase instead of simply relying on search can enable medium-sized open-source models to demonstrate excellent capabilities in solving complex out-of-distribution (OOD) scientific problems.

However, it should be noted that TTT-Discover is currently mainly applicable to continuous (verifiable) reward scenarios, and further work is needed to extend it to problems in sparse reward, binary reward, and non-verifiable domains.

Introduction to the Core Authors of the Paper

The first and co-first authors of the paper are Mert Yuksekgonul and Daniel Koceja.

Mert Yuksekgonul is currently pursuing a Ph.D. in the Department of Computer Science at Stanford University, under the supervision of Carlos Guestrin and James Zou.

Daniel Koceja is currently a full-time researcher at the Stanford Artificial Intelligence Laboratory (SAIL), under the guidance of Yu Sun.

Yu Sun is the corresponding author of this paper. He is currently a postdoctoral researcher at Stanford University and also a researcher at NVIDIA.

He graduated with a Ph.D. from UC Berkeley, under the supervision of Alexei Efros and Moritz Hardt.

Yu Sun's research direction is continual learning, focusing on test-time training, and he has been advancing related research since 2019.

Reference Links

[1]https://github.com/test-time-training/discover

[2]https://www.alphaxiv.org/abs/2601.16175

[3]https://openreview.net/profile?id=~Yu_Sun1

This article is from the WeChat official account "QbitAI", author: henry, published by 36Kr with authorization.