Mix mathematical programming logic data to enhance the multi - domain reinforcement learning ability of AI at one time.
In recent years, large AI models have achieved significant breakthroughs in reasoning abilities in the fields of mathematical calculation, logical reasoning, and code generation. Especially with the emergence of advanced models such as DeepSeek - R1, the Reinforcement Learning with Verifier Reward (RLVR) technology has demonstrated strong potential for performance improvement.
However, existing research on reinforcement learning and models mostly focuses on single - domain optimization, lacking a systematic exploration of cross - domain knowledge transfer and collaborative reasoning abilities to enable models to work collaboratively in multiple domains and exert better reasoning capabilities.
The OpenDataLab team at Shanghai AI Lab conducted large - scale experiments to deeply analyze the complex mechanisms of RLVR in multi - domain reasoning, providing key findings from multiple dimensions for building more powerful and robust AI reasoning models.
The team built a multi - domain evaluation framework covering three major categories of data: mathematics (Math), programming (Code), and logical puzzles (Puzzle), and designed customized reward strategies for different training data.
The experiments were based on the Qwen2.5 - 7B series of models. After jointly training the data from the three domains of mathematics, code, and puzzles, the overall average performance of the model reached 56.57, significantly better than any two - domain combination.
The research team made the following key findings through large - scale experiments:
Mutual support between Puzzle and Math data: Logical reasoning and mathematical abilities complement each other, significantly improving the overall performance of the model.
Cross - domain hybrid effect of Code reasoning: Instruct models with strong instruction - following abilities can better generalize code capabilities to other domains, while Base models cannot.
Cross - domain data improves robustness: Diverse data usually enhances model capabilities or achieves a more balanced performance, but more complex designs are needed to address potential conflicts among the Math, Code, and Puzzle domains.
SFT can improve the effect of reinforcement learning: Adding an SFT stage before reinforcement learning can significantly improve model performance.
Template consistency is crucial: The mismatch between training and evaluation Templates can lead to a significant decline in performance, indicating that the generalization robustness of RLVR in specific domain training faces challenges.
Benefits of Policy Refresh: Regularly updating the reference model and optimizer state during curriculum learning can improve model stability and performance.
Reward design needs to adapt to task difficulty: Adjusting the reward settings according to the model's performance on training data can improve learning efficiency.
RLVR is sensitive to language: The performance of models trained in Chinese is lower than that of models trained in English, showing a certain performance gap.
Research process and performance
Domain division and data construction: The "cornerstone" of multi - domain reasoning
The OpenDataLab team at Shanghai AI Lab built a multi - domain evaluation framework covering three major categories of data: mathematics (Math), programming (Code), and logical puzzles (Puzzle), and designed customized reward strategies for different training data.
The experiments were based on the Qwen2.5 - 7B series of models and explored the following aspects:
Model performance and generalization ability on data: Focus on single - domain data optimization and cross - domain generalization, as well as the mutual influence among cross - domain data.
Effectiveness of training methods and strategies: Evaluate the role of Templates in RLVR and the effectiveness of curriculum learning strategies.
Model optimization elements: Study the design principles of different reward mechanisms and the impact of training languages on model performance.
Through systematic experiments, the research revealed the internal mechanism of Reinforcement Learning with Verifier Reward (RLVR) in multi - domain reasoning, providing a new perspective for optimizing the reasoning ability of large models.
Single - domain training: A fierce competition within each domain
In single - domain training, the model shows significant performance improvement in specific tasks, but the cross - domain effects are complex, including both synergistic effects and mutual weakening.
Mathematics domain: RLVR improves mathematical performance, but cross - domain effects are complex
After targeted training, the accuracy of the Base model in the CountDown task increased by about 75 percentage points. At the same time, mathematical training can also effectively improve the model's ability to solve logical puzzles, and the average score has increased. However, while deeply optimizing mathematical abilities, it may also have a negative impact on code tasks, suggesting a certain trade - off relationship between skills in different domains.
Code domain: Instruction fine - tuning helps programming and shows stronger cross - domain generalization
Code training improves the model's performance in programming tasks. Especially, the Instruct model after SFT shows a higher performance ceiling. At the same time, the Base model often experiences a decline in performance in most out - of - domain tasks after code training, while the Instruct model shows stronger cross - domain generalization ability and can maintain or even improve performance in most out - of - domain tasks.
Puzzle domain: Strong logical reasoning ability, and partial training is beneficial for mathematical transfer
On the KK dataset, the accuracy of the Instruct model is as high as 99.14. In the Zebra task, the score increased to 36.20. In addition, the training effect of KK puzzles can also be transferred to mathematical tasks. Even in some mathematical benchmarks, the performance of the Base model is close to or exceeds that of the Instruct model, further demonstrating the potential of cross - domain transfer.
Cross - domain interaction: Exploration of collaboration and conflict
Two - domain combination: Exploring collaboration and trade - offs
- Combinations with significant synergistic effects: The Math + Puzzle combination improves the performance of the Math task to 49.72 (better than 47.48 in single - Math training), proving the effectiveness of cross - domain knowledge transfer; the Code task improves after adding Puzzle or Math data, showing the potential advantages of combined training.
- Combinations that need to be carefully handled: The Puzzle task performs worse in all multi - domain training than in single - domain training, highlighting its highly specialized nature; notably, the Math + Puzzle combination significantly reduces the performance of the Code task; while the Puzzle + Code combination achieves an average maximum increase of 19.39.
Three - domain combination: Balance and robustness
Next, combining the data from all three domains, the results are as follows. Multi - domain joint training shows better overall performance and robustness:
- The three - domain joint training achieves a breakthrough in overall performance: Jointly training the data from the three domains of mathematics, code, and puzzles, the overall average performance of the model reaches 56.57, significantly better than any two - domain combination.
- Data diversity and marginal returns: Increasing the diversity of training data (the number of domain combinations) can indeed improve the overall performance, but there is a trend of diminishing marginal effects in this improvement.
- Prevent performance collapse and achieve balanced development: Different from some two - domain combinations (such as Math + Puzzle, which may cause a sharp decline in the performance of the Code task), the three - domain joint training effectively avoids the "collapse" of performance in specific tasks and ensures that the model can remain competitive in all tasks.
Template consistency: Optimal performance
In RL training, a often - overlooked problem is the mismatch between the training and testing Templates. This may lead to a significant decline in model performance. The research team tested under different Templates (R1, Qwen, Base) and revealed the importance of Template consistency.
- Mismatched Templates can severely drag down performance: For example, when the Base model uses a mismatched template, the accuracy of CountDown drops from 19.36 to 0, and MBPP drops from 51.80 to 3.00. The Instruct model drops from 73.20 to 1.80 on MATH500.
- Consistent Templates usually bring the best performance: Under the R1 template, the average performance of the Base model reaches 47.84, and that of the Instruct model reaches 54.56, far exceeding the mismatched situation. This emphasizes the necessity of Template consistency - the generalization robustness of RLVR in specific domain training still faces challenges.
Curriculum learning: Conquer step by step from easy to difficult
Curriculum learning has proven effective in SFT, but its application in RLVR has not been fully explored. The research team tested on the KK dataset in the Puzzle domain, set a difficulty gradient based on the number of sub - problems (from 3PPL to 8PPL), and designed a "Policy Refresh" strategy - updating the reference model and resetting the optimizer state after each difficulty stage.
Through experiments, it was found that:
- Curriculum learning improves the performance ceiling: The final accuracy of standard curriculum learning reaches 97.29, far exceeding 94.29 in mixed training. This method helps the model gradually master complex dependencies and improve generalization ability.
- Policy Refresh accelerates convergence: After adopting the refresh strategy, the model reaches an accuracy of 97.43 in the 6PPL stage, and the final result is almost perfect (99.71), even exceeding the mixed - training result of the instruction model (99.14).
Reward design: Personalized customization
Reward design is the core of reinforcement learning. The research team tested four strategies on the KK and LPB datasets: (1) Binary reward requires all answers to be correct to score; (2) Partial reward scores according to the correct proportion; (3) Format reward uses labels to guide reasoning; (4) Rescaled reward adjusts the score range to [-1, 1] and punishes errors. Different designs shape completely different learning signals for the model.
The research team found that on the simple task KK, the binary reward R1 achieved the best performance with its straightforward reward setting; but in the complex task LPB, R1 collapsed during training due to sparse signals. The partial reward R2 can take effect quickly in the early stage of LPB but is difficult to maintain its advantage in the long term; the format reward R3 and the rescaled reward R4 took the lead in the LPB task by means of stable reasoning and amplifying behavior differences. However, complex designs become a burden on KK. The results show that the sparsity of the dataset and the task difficulty are the key factors determining the success or failure of the RLVR reward mechanism.
Looking to the future, the team calls for expanding the data classification in new fields such as Science and General Reasoning and exploring the adaptability of models such as Llama and DeepSeek. RLVR has proven its effectiveness in multiple fields, but regardless of the training method, data is always the cornerstone of model capabilities. It is also hoped that future research can more deeply explore the impact of data on RLVR.
Paper address: https://arxiv.org/abs/2507.17512
Training code: https://github.com/Leey21/A-Data-Centric-Study
This article is from the WeChat