The multimodal large model achieves pixel-level inference for the first time. With 3B parameters, it outperforms traditional models with 72B parameters. It is accepted by NeurIPS 2025.
The multimodal large model has for the first time achieved pixel-level reasoning, covering three major tasks: reference, segmentation, and reasoning all at once!
AI's "describing images" is now a piece of cake. However, even models like GPT-5 and Gemini 2.5 Pro can only "get a general idea" and struggle with more precise target recognition and reasoning.
In response, a research team from the Hong Kong Polytechnic University and Tencent ARC Lab has proposed the first unified pixel-level multimodal large model - UniPixel.
Without further ado, let's take a look at the effects of UniPixel:
With just the UniPixel model, it can complete three major tasks: target reference (Referring), pixel-level segmentation (Segmentation), and region reasoning (Reasoning), combining flexibility, precision, and scalability.
Currently, the paper has been accepted by NeurIPS 2025, and the code, data, and demo are all open-source!
Here is more detailed information.
UniPixel Redefines Visual Reasoning
Most traditional visual question-answering or description systems conduct reasoning based on the overall image or video information, lacking precise perception of "specific regions" or "designated targets" in the image.
This not only limits their practical applications in scenarios such as medical diagnosis, autonomous driving, and human-computer interaction but also fails to meet users' high-level requirements for "controllability" and "interpretability."
Take a daily task as an example: "Please point out the person sitting on the left in the picture and describe what he is doing." For humans, we quickly focus on the left target and make judgments and descriptions based on perspective, behavior, and context. However, for traditional LMMs, such questions are often difficult to answer accurately due to the lack of regional guidance and salience modeling.
UniPixel supports the entire process of "perception - memory - reasoning" of user prompts by introducing the "Object Memory Bank" and a unified visual encoding method that supports three types of visual prompts (points, boxes, masks).
Different from existing simple segmentation and region-level understanding models, UniPixel can not only identify the target pointed out by the user but also explicitly incorporate the target as context into subsequent conversations and output strongly correlated segmentation results, language answers, or descriptive content.
△
To achieve this goal, UniPixel has made systematic innovations in its architectural design.
As shown in the figure below, its overall framework is based on the Qwen2.5-VL model, supporting image and video input and having the ability to perceive and process various prompts such as text, points, boxes, and masks.
The user can input an image or video, a text prompt, and several optional visual prompts. The model then outputs natural language answers and optional spatial-temporal masks, enabling interaction based on visual detail information.
△
To enable this framework to truly have the ability of "pixel-level reasoning," UniPixel has further introduced three key modules:
- Prompt Encoder: Supports three types of visual prompts: points, boxes, and masks;
- Object Memory Bank: Used to store user-specified targets and support multi-round referencing;
- Mask Decoder: Achieves precise spatial-temporal mask generation.
In addition, UniPixel has expanded the vocabulary of the language model, adding special tokens such as <REF>, <MEM>, and <SEG> to guide the injection of visual prompts, the invocation of object memory, and the mask generation process, thus establishing a close connection between language generation and pixel perception.
Specifically, it includes three major technical highlights:
Unified Encoding of Three Types of Visual Prompts
To achieve the maximum degree of free interaction, UniPixel has designed the Prompt Encoder module to uniformly encode three types of visual prompts.
Whether it is a point, a box, or a mask, they can all be uniformly encoded as high-dimensional vectors in the same space.
This encoding method integrates information such as spatial coordinates, time positions, and prompt types and aligns with visual tokens through the encoding projection layer.
Compared with previous models that only accept text prompts or simplified image regions, UniPixel can handle more complex user inputs. For example, clicking on a target at the 5th second of a video and asking questions about the events before and after it. Such scenarios can be accurately parsed and processed through the combination of point prompts and time identifiers.
Object Memory Bank Mechanism, Endowing the Model with the Ability to Remember Targets
One of the core designs of UniPixel is its Object Memory Bank module, which is a dynamically updatable hash structure used to store and manage user-specified target regions during the reasoning process. Its operating mechanism is shown in the figure.
Specifically, whenever the user uses markers in the input to refer to a target, the model automatically triggers a "memory pre-filling" process, intelligently identifies and generates the corresponding spatial-temporal mask, and then writes it as object information into the memory bank.
This mechanism allows the model to reuse these memory objects continuously in multi-round conversations, achieving true "context-controlled reasoning."
If the user mentions a target again later, simply using the previously defined number can automatically activate the corresponding region. Through the "memory injection" mechanism, its features are inserted into the prompt for the LLM to conduct reasoning.
This mechanism breaks the limitation of the one-time "prompt - response" interaction in traditional methods, enabling the model to have the ability of "attention - memory - induction" similar to humans.
For example, when the user asks, "What kind of interaction is there between [1] and [2]?" the model can abstract the behavior trajectories of the two through masks and re-perceive from the original picture or video to generate a reasonable answer.
Mask-Guided Reasoning, Deeply Integrating Understanding and Segmentation
In addition to accurately identifying target regions, UniPixel also embeds the mask generation process into the language model reasoning process, achieving a two-way closed loop of "language-guided segmentation, segmentation feeding back understanding."
Specifically, during the reasoning process, the model generates <SEG> tokens as mask trigger flags. Each <SEG> token is input into the mask decoder, and the corresponding target mask is generated based on the context and known prompts.
These masks are then pooled from the original picture or video and converted into object features recognizable by the LLM to answer more complex semantic questions.
This mechanism greatly improves the model's performance in video understanding tasks. Take an actual task as an example: "What are the differences in the behaviors of [1] and [2]?" Through the modeling of the behavior regions of [1] and [2] and the comparison of mask features, UniPixel can accurately give an answer and point out the corresponding regions in each frame.
In addition, in the training process, UniPixel adopts a modular and phased training strategy.
The model first pre-trains the visual encoder and the language model, and then gradually introduces components such as the Prompt Encoder, Object Memory Bank, and Mask Decoder for joint training, enabling each module to work together without overfitting to specific tasks.
In addition, the authors have constructed and integrated multiple datasets, covering three types of data: text, images, and videos, as well as various types of visual prompts (points, boxes, masks).
The entire training data scale reaches approximately 1 million samples (see the table below for details), supporting various task types from static object reference to temporal mask generation. These data provide a unified and diverse training environment for the model, enhancing its adaptability under different task settings.
△
Experiments and Evaluations
To verify the effectiveness of the UniPixel framework, the authors conducted extensive experiments on 10 public benchmark datasets, covering 9 major visual-language understanding tasks. The specific tasks and dataset settings are shown in the figure.
△
Target Segmentation Task
Thanks to the unified framework design and the progressive training paradigm, UniPixel shows significant performance advantages in the segmentation task.
Among them, on the relatively difficult ReVOS reasoning segmentation benchmark, UniPixel-3B reaches 62.1 J&F, surpassing all existing models, indicating its stronger ability to model the relationship between understanding complex text prompts and pixel-level mask generation. The complete test results of the ReVOS dataset are shown in the following table:
On other datasets such as MeViS, Ref-YouTube-VOS, and RefCOCO/+/g, UniPixel also shows the best performance. The test results of the MeViS, Ref-YouTube-VOS, Ref-DAVIS17, and GroundMore datasets are shown in the table:
The test results of the RefCOCO/+/g (cIoU) and ReasonSeg datasets are shown in the following table:
Region Understanding Task
On the VideoRefer-Bench benchmark, UniPixel also achieves leading performance in video region understanding with mask prompts, showing its adaptability and robustness to visual prompts.
This task requires the model to understand the mask region specified by the user based on complex language descriptions and correctly parse its dynamic changes and semantic relationships in the video.
With its object memory mechanism and multi-modal collaborative encoding ability, UniPixel can accurately capture the boundaries and behavior changes of the target region.
Among them, the test results of the VideoRefer-Bench-D dataset are as follows:
The test results of the VideoRefer-Bench-Q dataset are as follows: