Let the large model "view and modify simultaneously", and the accuracy of visual segmentation increases by 9% directly.
In the era of agents, how to make visual segmentation more accurate?
Fudan University and Chuangzhi College jointly launched RSAgent, providing the latest answer - enable large multimodal models to generate accurate masks through multi - round tool calls.
The relevant work has been selected for ICML 2026.
Visual segmentation has always been a task that is "easy to talk about but difficult to execute successfully".
Give a model an image and a sentence, and ask it to circle the pixels of the target area - it sounds straightforward. However, once the target is vague, occluded, or requires reasoning to locate, it becomes quite difficult to guess the correct mask at once.
The RSAgent team believes that what the existing methods lack is not a stronger segmentation head, but a process of "confirmation and error correction".
To this end, they launched RSAgent, an agent framework that enables large multimodal models to complete text - guided segmentation through multi - round tool calls.
The model no longer outputs the mask in one go. Instead, it first observes the image, conducts reasoning, calls visual tools, reads feedback, and then iteratively corrects based on historical results to finally generate a more reliable and accurate mask.
The experimental results show that on the ReasonSeg test set, the gIoU of RSAgent has increased by 9.0 percentage points compared to Seg - Zero - 7B. On the RefCOCOg dataset, it has achieved an average cIoU of 81.5% and generated more than 5000 multi - round reasoning segmentation trajectories.
What are the difficulties in open semantic segmentation?
Large multimodal language models (MLLMs) can already describe images, answer questions, and understand object relationships. However, what real - world visual systems need is not just textual answers.
Interactive annotation, robot perception, design editing, industrial quality inspection, and scientific image analysis all require the model to translate language understanding into pixel regions.
That is to say, the model must achieve a reliable conversion between "semantic understanding" and "accurate masks".
The challenge of open semantic text - guided segmentation lies in the fact that input instructions are not always simple category names -
The user may say "the object being picked up by a person on the left side of the picture", or "find the equipment that ensures personal safety in the turbulent water".
The former requires spatial relationships, while the latter requires scene common sense and usage reasoning.
If the model only makes a single forward prediction, it is difficult to verify whether it has selected the correct target.
The shortcoming of previous approaches is not "inability to generate masks", but "lack of a confirmation and error - correction process".
Once the initial positioning deviates, the point prompts fall on the background, or the candidate regions only cover a part, the model often has no chance to re - observe, zoom in on the view, read the candidate results, and adjust the strategy.
RSAgent addresses this pain point by transforming the segmentation task from static prediction to dynamic interaction. The team said:
The pain point is not simply to pursue a more complex segmentation head, but to enable the model to have the ability to "judge first, act, observe feedback, and then correct" in open semantic tasks.
How to solve it? Let MLLM learn to Reason and Act
The key of RSAgent is not to directly transform MLLM into a mask decoder, but to make it an agent capable of scheduling visual tools.
In each round, the model receives the original image, text instructions, and historical observations, and outputs structured reasoning and tool calls. The tools return local views, candidate masks, or overlays. Then, based on this feedback, the model decides whether to continue calling tools, adjust prompts, or submit the final answer.
The following figure shows the comparison between LISA, Seg - Zero, and RSAgent. RSAgent continuously locates, observes, and corrects through multi - round tool calls.
The overall framework of RSAgent is as follows, including multi - round interaction, tool calls, observation feedback, cold - start SFT, and agentic RL.
The specific technical modules and their functions are as follows:
At the data level, RSAgent constructs training trajectories through automatic synthesis and strict screening.
The cold - start SFT data in the paper contains approximately 5K high - quality multi - round reasoning trajectories. In the RL stage, about 2K RL examples are used, and an additional 8K RefCOCOg training samples are added to enable the model to learn tool - calling paths with higher rewards in the interactive environment.
The following figure shows the data pipeline. The system generates questions, synthesizes multi - round trajectories, and filters them to obtain high - quality training samples.
The team said that the real key is not just "calling tools": RSAgent closes reasoning, tools, feedback, and rewards into a training system.
The model not only needs to understand the target but also learn to adaptively zoom, prompt, segment, and stop, and finally translate open semantic understanding into accurate masks.
Specifically, an interaction of RSAgent can be understood as a four - step cycle:
- Observation reads the image and historical results;
- Thought analyzes whether the current candidate region meets the instructions in natural language;
- Action selects tools and pixel prompts;
- Feedback receives the tool output and writes it into the context.
This cycle enables the model to no longer rely on single - time judgment but have a mechanism for step - by - step verification.
This mechanism is particularly suitable for relational, attribute - based, and implicit reasoning - based instructions.
For example, the target may be very small, occluded, or need to be determined based on actions, uses, and relative positions.
RSAgent can first perform rough positioning, then view the local area, and then re - specify points or boxes based on the deviation of the candidate masks.
Compared with single - time prediction, it has an additional reviewable intermediate process.
In terms of the training strategy, cold - start SFT solves the problem of "whether it can work according to the format", enabling the model to master the syntax of tool calls and the basic reflection process; agentic RL solves the problem of "how to do better", optimizing the multi - round path through reward signals.
The combination of the two enables RSAgent to stably output structured results and learn better decisions on complex open semantic samples.
Experimental results: Leading performance on ReasonSeg and RefCOCOg
The experiment uses Qwen2.5 - VL - 7B - Instruct as the base model and SAM2 - large as the segmentation tool.
The team conducted a systematic evaluation on the RefCOCO series and ReasonSeg, and compared it with various methods such as traditional vision - language segmenters, single - time MLLM segmentation methods, explicit CoT/RL segmentation methods, and multi - round tool - calling agents.
The following figure shows that RSAgent has achieved leading performance on the RES and ReasonSeg benchmarks.
The specific evaluation results are as follows:
On the ReasonSeg test, RSAgent achieved a gIoU of 66.5%, an increase of 9.0 percentage points compared to 57.5% of Seg - Zero - 7B;
On RefCOCOg, RSAgent achieved an average cIoU of approximately 81.5%, and the test split was 81.8.
For target segmentation tasks that rely on open semantic reasoning, this shows that the model can not only understand the description but also more stably translate the understanding into accurate masks.
Ablation experiments show that the improvement does not come from a single module.
The untrained tool - agent only had a cIoU of 30.1 on the ReasonSeg test; after adding cold - start SFT, it increased to 55.4; with only RL, it was 54.3; the complete cold - start SFT + RL reached 57.9.
This indicates that first enabling the model to learn standardized tool calls and then optimizing long - term decisions through reinforcement learning is the key to the success of RSAgent.
The following figure shows the ablation of the maximum number of tool - calling rounds. Appropriately increasing the number of rounds can improve performance, but an overly long context may bring redundancy and instability.
Reward design is also crucial.
Removing the final reward, process reward, or format reward will all cause a decline in performance;
After removing the final reward, the ReasonSeg test score dropped from 57.9 to 48.3, indicating that the quality of the final mask is still the core goal.
The process reward encourages the model to continuously improve in the intermediate steps rather than blindly increasing the number of tool calls.
Enable large visual models to enter the verifiable pixel action space
The value of RSAgent is not just to refresh the indicators.
More importantly, it demonstrates a path from "image - viewing and question - answering" to "visual action":
The model can continuously observe around the text target, call tools, receive feedback, correct assumptions, and translate the final judgment into image pixels.
This type of ability has general significance for interactive visual systems.
- For data annotation, it is expected to reduce manual trial - and - error;
- For robot perception, it allows the model to re - confirm the target area before execution;
- For design editing and content production, it can translate natural language intentions into more stable editable regions;
- For scientific image analysis, it provides a reviewable and verifiable intermediate process.
From a broader trend perspective, RSAgent connects open semantic understanding, tool calls, and pixel - level execution.
It shows that large multimodal models do not have to stay at "answering image questions" but can also actively explore, make mistakes, and correct in the visual space.
This direction advances visual agents to a form closer to real - world tasks. In a nutshell:
RSAgent proves that large multimodal models can progress from "combining text and image content" to "reasoning, acting, and self - correcting in the pixel space".
Finally, let's introduce the paper team.
The author team is from Fudan University, Shanghai Chuangzhi College, Shanghai Jiao Tong University and other institutions. The co - first authors of the paper are He Xingqi and Zhang Yujie.
He Xingqi is a first - year master's student at Fudan University, with research interests in Vision - Language Model Reasoning and Reinforcement Learning.
Zhang Yujie is a jointly - trained doctoral student at Shanghai Chuangzhi College and Fudan University, mainly researching Vision - Language Model Reasoning, Reinforcement Learning, and Large Language Models.
Paper: https://arxiv.org/abs/2512.24023
GitHub: https://github.com/Nicola777 - ai/RSAgent
This article is from the WeChat official account "QbitAI", author: Zhang Yujie from Shanghai Chuangzhi College. Republished by 36Kr with permission.