The Chinese University of Hong Kong collaborates with Meituan to open-source a "vision reasoning generalist" that can handle 10 types of image and video tasks all at once.
The "generalist" in the field of vision models is here, sweeping across 31 mainstream benchmarks and excelling in 10 core tasks!
The MMLab of The Chinese University of Hong Kong and the research team of Meituan have open - sourced and proposed OneThinker — a unified multi - modal visual reasoning generalist model based on RL, covering ten core visual tasks in both image and video modalities.
In the tests of 31 mainstream visual tasks, OneThinker has shown outstanding performance. It can not only promote mutual improvement in multi - task training but also make reasonable inferences on tasks it has never seen before, initially demonstrating the generalization ability of a generalist model.
Although works represented by Vision - R1, Video - R1, VLM - R1, etc. have achieved remarkable results in tasks such as image question - answering, video understanding, and object detection.
However, most of these RL models have a limitation: they can only handle a single modality or a single task. There is a lack of association between modalities and tasks, the reasoning ability is fragmented, and it is difficult to generalize and apply.
Let's see how OneThinker does it.
From "specialist models" to "generalist systems"
Visual data in the real world is complex and diverse, often containing both static image and dynamic video information. At the same time, the types of visual tasks are also highly diverse, such as question - answering, positioning, segmentation, and tracking.
In this context, the traditional "single - task, single - modality" RL thinking model architecture has the following two fundamental problems:
- Unable to model complex real - world scenarios in a unified way
In real applications, it is often necessary to understand both image and video content simultaneously and complete the collaboration of multiple types of tasks, which specialist models can hardly meet.
- Knowledge isolation and limited migration
The models are independent of each other, lacking a knowledge - sharing mechanism, which limits the generalization and migration of reasoning ability between tasks.
To solve this problem, the research team proposed a "generalist thinking model" OneThinker, which has the ability to understand and reason about different modalities and tasks in a unified way.
To enable OneThinker to truly have the ability to reason about different modalities and tasks in a unified way, the research team started from two aspects: one is to build a unified data system, and the other is to optimize the multi - task training method.
Construction of unified multi - modal task data
To build a model with general visual reasoning ability, the problems of insufficient data coverage and fragmented tasks need to be solved first.
For this reason, the research team carefully built a dataset for the SFT cold start and reinforcement learning training of the model respectively:
- OneThinker - 600k
It covers both image and video modalities and includes ten core visual tasks such as image question - answering, video question - answering, spatio - temporal positioning, segmentation, and tracking. It is the main training data for the reinforcement learning stage.
- OneThinker - SFT - 340k
High - quality thought - chain samples are generated from OneThinker - 600k based on Seed1.5 - VL and filtered, which are used for the cold start of the SFT stage.
Through the joint training of image and video tasks, OneThinker can establish a unified reasoning ability in the spatial and temporal dimensions, thus achieving general understanding across modalities and multiple tasks.
EMA - GRPO: Improving the stability of multi - task RL training
Traditional reinforcement learning methods have significant training imbalance problems in multi - task and multi - modal scenarios.
The reward structures of different tasks vary greatly (for example, the rewards of detection tasks are dense, while those of question - answering tasks are often sparse), which easily leads to the problem of training imbalance between samples or tasks.
For this reason, OneThinker introduces a new EMA - GRPO (Exponential Moving Average Group Relative Policy Optimization) reinforcement training algorithm. By performing sliding average normalization on the standard deviation of rewards for each task, it solves the imbalance problems at two levels:
Uneven sample weights within tasks: Alleviate the model's over - dependence on low - variance samples;
Imbalance of gradient contributions between tasks: Prevent sparse tasks from dominating in backpropagation and suppress the learning of other tasks.
Experimental results show that EMA - GRPO can significantly improve the training stability and convergence speed in the reinforcement learning stage, providing effective support for the multi - task training of large - scale unified reasoning models.
Experimental results
To comprehensively evaluate the ability of OneThinker, the research team conducted systematic tests on 31 mainstream benchmarks of different tasks in the image and video modalities, covering 10 core visual tasks such as image question - answering, video understanding, spatial positioning, temporal positioning, object segmentation, and object tracking.
OneThinker performed excellently in the image question - answering task, with MMMU reaching 70.6% and MathVerse reaching 64.3%. In video understanding, VideoMMM achieved a performance of 66.2%.
In the temporal and spatial positioning tasks, the model also achieved a high score of 93.7% in the spatial positioning task of RefCOCO testA, and the R@0.5 of Charades and ActivityNet reached 68.3 and 43.6 respectively.
At the same time, OneThinker achieved an AO of 73.0 in the tracking task GOT - 10k and a J&F score of 54.9 in the video segmentation task ReasonVOS, demonstrating its robust performance in perception - related tasks. For more task performance, please refer to the original article.
The research team also found that OneThinker can achieve effective knowledge transfer and sharing between certain tasks and modalities, and different tasks can promote each other.
At the same time, OneThinker shows zero - shot ability on unseen tasks and can directly adapt to tasks such as point tracking, image quality assessment, GUI understanding, and rotating object detection, reflecting its strong task generalization ability.
It can be said that the launch of OneThinker not only demonstrates the potential of reinforcement learning in unified multi - modal and multi - task visual reasoning but also provides a clear path for building a real visual generalist model.
In the trend of large models moving towards multi - modality, strong reasoning, and generalist capabilities, the work of OneThinker may only be a starting point, but the direction it validates is becoming a key link on the way to Artificial General Intelligence (AGI) in vision.
For more details, please refer to the original article.
Paper address: https://arxiv.org/pdf/2512.03043
Code address: https://github.com/tulerfeng/OneThinker
This article is from the WeChat official account "QbitAI". Author: OneThinker team. Republished by 36Kr with permission.