New ideas for RL: Fudan University enhances general reasoning of VLM with games, and the performance rivals geometric data
The NLP Laboratory of Fudan University developed Game-RL, which utilizes the rich visual elements and clear rules of games to generate multimodal verifiable reasoning data, and enhances the reasoning ability of vision-language models through reinforcement training. It innovatively proposed the Code2Logic method to systematically synthesize game task data and construct the GameQA dataset, verifying the advantages of game data in complex reasoning training.
Existing work has used RL to enhance the reasoning ability of vision-language models (VLMs), but their task scenarios are often geometric or chart reasoning. This limitation in the domain restricts the exploration and learning of VLMs.
How to expand the RL training domain of VLMs?
Video games have rich visual elements and clear and verifiable rules, so they are an ideal data source for multimodal reasoning.
Therefore, the research team from the NLP Laboratory of Fudan University proposed Game-RL - constructing multimodal verifiable game tasks to strengthen the training of VLMs.
- Paper link: https://arxiv.org/abs/2505.13886
- Code repository: https://github.com/tongjingqi/Game-RL
- Data and models: https://huggingface.co/Code2Logic
To obtain training data (as shown in the example in Figure 1), the researchers also proposed the novel Code2Logic method to systematically synthesize data through game codes.
Figure 1: Representative games of each game category in the GameQA dataset: 3D reconstruction, Tangram (variant), Sudoku, and Sokoban. Each game shows two visual question-answering examples, including the current game state picture, the corresponding question, and the step-by-step reasoning process and answer.
The Code2Logic method innovatively synthesizes multimodal verifiable game task data based on game codes.
As shown in Figure 2, use a powerful LLM to generate game codes, design tasks and their templates, and construct data engine codes. Finally, just execute the codes to automatically generate data.
Figure 2: The Code2Logic method uses an LLM to convert game codes into reasoning data through three core steps. Step 1: Game code construction; Step 2: Game task and its QA template design; Step 3: Data engine construction. Based on the first two steps, an automated program is constructed, and then just execute the codes to automatically generate data in batches.
The GameQA dataset with rich game tasks
The GameQA dataset was constructed using the Code2Logic method. These multimodal verifiable game data can be used for the training and evaluation of the reasoning ability of VLMs.
GameQA has: 4 cognitive ability categories, 30 games (as shown in Figure 3), 158 reasoning tasks, and 140,000 question-answer pairs.
Difficulty levels: Tasks are divided into three levels according to difficulty; samples are divided into three levels according to the complexity of visual input.
Figure 3: The 30 games in GameQA are divided into 4 cognitive ability categories, covering 3D spatial reasoning, pattern recognition and matching, multi-step reasoning, and strategic planning. 20 in-domain games are used for training and testing, while 10 out-of-domain games are not involved in training and are used to test the generalization ability of the model in unseen game scenarios.
Core finding: Game-RL can enhance the general reasoning of VLMs
Training with GRPO on GameQA, 4 open-source VLMs have all improved on 7 completely out-of-domain general vision-language reasoning benchmarks (Qwen2.5-VL-7B has an average improvement of 2.33%), showing cross-domain generalization, as shown in Table 1.
Table 1: Evaluation results on general vision-language reasoning benchmarks
Training effect: GameQA is comparable to geometric datasets
The research team conducted comparative training using GameQA and geometric and chart reasoning datasets and found that GameQA is comparable to them.
As shown in Table 2, although the training data volume is smaller and the domains do not match, the model trained with GameQA performs competitively on general benchmarks. Moreover, on the MathVista and MathVerse benchmarks related to geometric and function reasoning, Game can even match the training with more "relevant" geometric reasoning data.
This indicates that the cognitive diversity and reasoning complexity in games have generality and transferability.
Table 2: Comparative training, 5K GameQA samples vs. 8K MAVIS (geometric and function visual reasoning) vs. 8K Multimodal-Open-R1 (mainly geometric reasoning) vs. 8K MultiMath (comprehensive multimodal reasoning in the mathematical domain). The model trained with GameQA is generally competitive. The experiment also shows that mixed training (adding GameQA data to MultiMath) can help the model improve more.
Scaling Effect: The influence of training data volume and the number of games
Scaling Effect of data volume: Increasing the training GameQA data volume to 20K, the experiment shows that the performance of the model on general reasoning benchmarks generally shows a continuous improvement, as shown in Figure 4.
Figure 4: Scaling Effect of training data volume
Scaling Effect of the number of games: As the number of game types for training increases, the out-of-domain generalization effect is enhanced, as shown in Figure 5.
Figure 5: Training with the tasks of 20 game types, the improvement of the model on out-of-domain general benchmarks is better than the configuration using 4 or 10 game types.
In-depth analysis: Where does the model's ability improve after Game-RL?
To better understand the improvement of the reasoning ability of VLMs by Game-RL, the research team randomly sampled cases for detailed manual analysis. The results show that after Game-RL, the model has improved in both visual perception and text reasoning, as shown in Figure 6.
Figure 6: Through manual qualitative analysis, it is known that the model's visual perception and text reasoning abilities have both improved. The two pie charts above show the changes in visual perception and text reasoning abilities on out-of-domain general benchmarks respectively, and the following is an example of the improvement in visual perception ability.
Conclusion
The research proposed Game-RL and the game data synthesis method Code2Logic, constructed the GameQA dataset, and expanded the reinforcement training domain of VLMs to game scenarios.
Through experiments, the research team verified that Game-RL can enhance the general reasoning of VLMs.
Furthermore, it also reveals that game scenarios can provide multimodal, controllable, and verifiable data, which is of great value.
Reference materials:
https://arxiv.org/abs/2505.13886
This article is from the WeChat public account "New Intelligence Yuan", author: LRST, published by 36Kr with authorization.