The training speed is increased by 1.8 times, the inference overhead is reduced by 78%, and the accurate selection of questions efficiently accelerates RL training.
A series of works based on Reinforcement Learning with Value Regularization (RLVR) fine - tuning, represented by DeepSeek R1, have significantly improved the reasoning ability of large language models. However, behind this wave, the cost of reinforcement fine - tuning is astonishingly high.
A large part of the high cost comes from the “inefficiency” in the training process. If you keep learning inappropriate “exam questions,” you won't learn anything and cause a huge waste. Imagine training a “math genius.” If you ask him to do thousands of questions, it's meaningless if the questions are too simple, like “1 + 1”; if the questions are too difficult and he has no idea how to solve them, it's also a waste of time.
Truly efficient training comes from those questions that are “just within reach after a little stretch.”
Previously, academia and industry mainly had two strategies to “select questions” for large models:
“Question - bank Strategy” (Uniform Sampling): Randomly select questions from the question bank for the large model. This leads to a large amount of computing power being wasted on questions that cannot provide effective learning signals. For example, when GRPO faces questions with all - correct or all - wrong answers, the gradient collapses to 0, losing the update effect and wasting resources.
“Test - then - Learn” (Dynamic Sampling, DS): Some online sampling methods (such as DS in DAPO) have been proposed to accelerate training. It allows the large model to “self - test” a larger set of candidate questions and then screen out questions of moderate difficulty for training. However, “self - testing” itself requires a large number of LLM inferences, and the cost is still high. It's like trying to save the genius student's time but asking him to spend more time on an extra diagnostic test.
Is there a method that can precisely select the most suitable questions in terms of difficulty without the expensive “self - testing” of large models?
MoPPS: Lightweight Prediction, Precise Question Selection
Facing this challenge, Professor Jixiangyang's THU - IDM team from Tsinghua University, in cooperation with the CompVis team from the University of Munich, proposed a brand - new framework: Model Predictive Prompt Selection (MoPPS).
This work has been accepted by KDD 2026 and has attracted the attention of industries including Tongyi Qianwen, Tencent Hunyuan, and Ant Group, as well as citations from well - known academic teams such as Professor Zhang Tong from UIUC, Professor Wang Jun from UCL, and Professor Max Welling from UvA.
The core problem that MoPPS solves is:
Can we dynamically predict the difficulty of questions without expensive large - model evaluation and precisely select training data accordingly to more efficiently improve the model's reasoning ability?
△ Dynamic Sampling in the DAPO algorithm relies on large - model self - evaluation, resulting in significant computational overhead. MoPPS uses a lightweight Bayesian model to quickly estimate question difficulty, enabling efficient question screening and accelerating training.
The idea and implementation of MoPPS are very simple:
1. Model questions as “Multi - armed Bandits”
MoPPS regards each question (prompt, τ) as an arm of a multi - armed bandit.
Each question has an unknown “winning probability,” which is the probability (success rate) that the model answers correctly under the current model parameters.
The goal of training is to preferentially select questions that are more valuable for training, that is, questions of moderate difficulty with a success rate close to 0.5.
2. Lightweight Bayesian Difficulty Prediction (Bayesian Inference)
MoPPS assigns a Beta distribution to each question to estimate its success rate:
Without prior knowledge, the success - rate distribution of a question is initialized as a uniform distribution Beta(1,1). If there is reliable prior knowledge, it can be set accordingly to improve the effect.
As the training progresses, the large model generates “success/failure” feedback. These binary feedbacks are directly converted into updates of the Beta distribution:
α′ = α + number of successes, β′ = β + number of failures
This recursive update method not only has extremely low computational cost but also can accumulate more and more accurate difficulty estimates as the training progresses. MoPPS also introduces a time - decay factor to adapt to the dynamic environment where the model's ability is constantly changing.
α′ = λ·α + (1 − λ)·α⁰ + number of successes, β′ = λ·β + (1 − λ)·β⁰ + number of failures
3. Active Question Screening (Active Selection with Thompson Sampling)
MoPPS does not rely on real LLM self - testing but directly samples to predict difficulty from the Beta distribution:
Use Thompson Sampling: Draw a difficulty estimate for each candidate question to balance exploration and exploitation.
Select the question closest to the target difficulty γ∗≈0.5 (that is, the “just - right” question) from the candidate set.
Only use the selected questions for RL training; then, the real feedback updates the Beta distribution in turn, forming a closed - loop.
This design has three prominent advantages:
Extremely low overhead: Prediction is based on Beta - distribution sampling, and no additional LLM inferences are required.
Dynamic adaptation: Online updates make the difficulty estimate more and more accurate.
Balance between exploration and exploitation: Thompson Sampling introduces randomness, which can not only select the known optimal questions but also explore potentially valuable new questions.
MoPPS proposes a new paradigm of prediction - sampling - optimization:
△ Figure 1: Overview of the MoPPS framework and comparison with DS.
Amazing Results: 1.8x Speed - up, 70% Reduction in Inference Cost
MoPPS shows significant advantages in three major reasoning tasks: mathematics, logic, and visual geometry:
Significant reduction in computing power cost.
Compared with the “Test - then - Learn” method (such as DS) that requires a large number of additional inferences, MoPPS requires up to 78.46% fewer roll - outs to achieve the same performance!
△ Figure 2: In the Countdown task, MoPPS outperforms the uniform selection strategy in both training efficiency and performance. Meanwhile, compared with the DS method, it significantly reduces the computational overhead of roll - outs.
Remarkable improvement in training efficiency.
Compared with the traditional “Question - bank Strategy” (Uniform Sampling), MoPPS can always select the most critical questions for the model, greatly accelerating the training process. It achieves a training speed - up of up to 1.6 to 1.8 times, and the training effect is better.
△ Figure 3: Training curves of MoPPS and baseline methods in three types of reasoning tasks and for models of different scales.
Accurate and reliable difficulty prediction.
Experiments show that there is a very high correlation (Spearman Rank Correlation) between the question difficulty predicted by MoPPS and the real question difficulty, which proves the effectiveness and reliability of its prediction.
△ Figure 4: In all experiments, the correlation quickly climbs and stabilizes at a high level above 0.5 in the early stage of training, proving the accuracy of MoPPS prediction.
Strong applicability and extensibility of the method.
1. Compatible with multiple reinforcement learning algorithms:
As a “data filter,” MoPPS can be used plug - and - play and is compatible with multiple RL algorithms such as PPO, GRPO, and Reinforce++.
2. Support different sampling strategies and introduce prior information:
MoPPS uses the Top - B sampling strategy by default, but it can also be extended to threshold sampling (screening questions whose difficulty falls within a certain range). In addition, it can combine prior knowledge to further accelerate the early - stage training.
△ (a) MoPPS can use different screening strategies and can combine prior knowledge to improve the effect. * (b) Online question screening is more effective than offline screening
Conclusion
This research, a collaboration between the THU - IDM team from Tsinghua University and the CompVis team from the University of Munich, provides a powerful tool for “cost - reduction and efficiency - improvement” in the field of large - model reinforcement fine - tuning.
The core contribution of the MoPPS framework is to propose a brand - new “predict - then - optimize” paradigm. In the future, MoPPS is expected to be applied to the post - training of large - scale large - model reinforcement learning.
Paper Title: Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models?
Paper Link: https://arxiv.org/abs/2507.04632
Code Link: https://github.com/thu-rllab/MoPPS
Team Homepage: https://www.thuidm.com
This article is from the WeChat official account “QbitAI”, author: THU - IDM team of Tsinghua University. It is published by 36Kr with authorization.