Chinese Academy of Sciences and ByteDance Jointly Propose "GAR", Surpassing NVIDIA's Describe Anything and Contributing to DeepSeek

Natural images can also be used as a form of text compression.

Recently, DeepSeek-OCR proposed a new idea of "Vision as Context Compression". However, its main research focuses on using the OCR capabilities of the model to compress documents with images.

So, can natural images also be used for text compression? "Grasp Any Region" jointly proposed by the Chinese Academy of Sciences and ByteDance provides a new perspective.

The team believes that the precise region captioning ability achieved by their latest work, Grasp Any Region (GAR), offers one of the potential paths for constructing dense captions for natural images.

Specifically, GAR has three capabilities:

1. Accurately describe the region specified by the user.

2. Model the relationships between multiple regions.

3. Conduct complex combinatorial reasoning (such as non - entity discrimination as shown in the picture).

△

Let's take a closer look.

The trade - off between local details and global information

First, what are Region MLLMs?

Different from traditional MLLMs, Region MLLMs aim to achieve fine - grained and interactive understanding of image/video content.

Specifically, users can provide various visual prompts (regions) and user instructions, and the model needs to accurately understand the specific region based on these.

For example, "Please describe this region", or "What is the relationship between Region 1 and Region 2", or even judge "Whether Region 1 and Region 2 are in the mirror".

Second, why study Region MLLMs?

The essence of DeepSeek - OCR relies on the ability of large multimodal models to accurately caption images, and it has also preliminarily explored the path of information compression based on full - image captions of natural pictures.

However, it is often difficult to evaluate full - image captions.

Region captions are different. For the region specified by the user, it is easy to objectively evaluate the model's caption from basic aspects such as color, texture, shape, and material, just like the DLC - Bench done by NVIDIA's Describe - Anything.

If a model has the ability to accurately caption regions, it can combine with SAM to merge accurate region captions into a detailed and accurate full - image caption, further realizing information compression.

More importantly, this detailed caption can not only benefit the pre - training of MLLMs but also assist generative models in understanding complex user instructions.

In addition, region captions themselves can also serve as an important data source for editing models and scene generation models in AIGC.

For many years, Region MLLMs have been caught in a dilemma between local details and global information.

The Osprey proposed by a researcher at Zhejiang University obtains local features through masked pooling, resulting in the loss of local details;

While the DAM (Describe Anything Model) proposed by NVIDIA inputs an additional cropped sub - image, resulting in the loss of global information.

△ Figure 2: Comparison of descriptions of the same region by GAR - 1B and DAM - 3B

For example, the above example shows the phenomenon of global information loss in DAM. The region specified by the user is clearly a frog - shaped slipper, but DAM misidentifies it as a frog.

Seeing is believing

In contrast, GAR can accurately understand the region specified by the user and produce more accurate descriptions.

For example, GAR can correctly identify and describe objects, while DAM makes misidentifications.

Moreover, GAR can accurately identify extremely small objects.

It can also accurately identify extremely small objects.

Furthermore, it can use extremely small image details to correctly model the relationships between objects.

Especially in the example on the right side of the following figure, both OpenAI - o3 and Gemini - 2.5 - Pro mistakenly think that the person is reading a book.

However, in fact, the person's eyes are looking at the camera, and she is just holding the book, not reading it. This highlights the strong ability of the GAR model to understand details.

GAR can also conduct complex combinatorial reasoning, such as comprehensively judging whether multiple prompts are in the mirror.

In addition, GAR can be well transferred to video description, and its description of the appearance in videos is very accurate.

Meanwhile, in video understanding tasks, GAR can accurately identify objects, people, and actions in videos and conduct in - depth semantic analysis.

It can also accurately understand a single region in a video and even identify motion information (as shown in the example on the right in the following figure).

Wow, how did it achieve such strong performance?

Fine - grained + global context

Specifically, when designing the GAR model, the team followed the core principle of "achieving fine - grained understanding of the prompted region while retaining and utilizing the global context of the entire scene".

As shown in the following figure, the team introduced two new components into the traditional MLLM architecture:

1. A simple and efficient prompt encoding scheme;

2. An innovative Region of Interest (RoI) - aligned feature replay technology.

△

GAR generates a global feature map of the entire scene through a visual encoder, thus completely retaining the global context information.

Meanwhile, the RoI - Aligned Feature Replay mechanism can extract high - fidelity features for specific target objects.

Finally, the global context features and refined local features are jointly input into the LLM to accurately infer the complex associations and interaction relationships between multiple objects.

Let's take a closer look below.

To integrate spatial guidance information into the visual backbone network, the team introduced a lightweight prompt encoding mechanism.

First, the binary mask specified by the user is processed by a simple convolution block initialized from zero to generate a mask embedding;

Subsequently, it is added to the patch embedding of ViT to complete the fusion of spatial information and visual features.

To provide both sufficient local details and necessary global context, the team proposed the RoI - aligned feature replay technology.

Specifically, the model processes the complete, uncropped image (including the mask prompt) by slicing to generate a global feature map, which is rich in context information.

Then, a corresponding bounding box (bbox) is generated for the region of interest based on the input mask, and the RoI - Align technology is used to directly extract the relevant features of this region from the global feature map, as shown on the right side of Figure 3.

Since these features are essentially derived from the "feature map calculated based on the entire image", they are naturally context - aware.

Meanwhile, the replayed features can provide the subsequent language model with a "high - detail, high - resolution" feature representation of the region specified by the user, helping it achieve fine - grained understanding.

This "replay mechanism of context - rich features" enables GAR to "focus on details" without "ignoring the global".

Experiments have proven that this design can achieve two major goals simultaneously:

1. Provide sufficient local details.

2. Retain global context.

△

To improve the model's ability to "recognize basic targets in a single region" and further support "complex association reasoning in multiple regions", the team designed a multi - stage process to generate a large - scale, high - quality dataset, as shown in Figure 4.

Specifically, first, a seed description generation model is trained with a seed dataset. Based on this model, inference is performed on the ImageNet - 21K fine - grained image classification dataset, and filtering is carried out according to the category names to construct 456,000 fine - grained description data;

Subsequently, the fine - grained description generation model is trained by combining the above two types of datasets, and with the help of the annotation information of the Panoptic Scene Graph dataset, a sufficient number of association - aware descriptions and question - answer pairs are generated.

Finally, the team uses these three parts of data to train the GAR model.

Stage 1: Improve recognition ability.

In the initial stage, the team based on the Describe Anything - 1.5M dataset.

However, the team found that the model (Seed - Captioner) trained with this dataset has deficiencies in fine - grained recognition ability: the model often makes wrong object identifications, which limits the quality of descriptions generated in more complex scenarios.

To solve this problem, the team ingeniously introduced the ImageNet - 21K data. Since ImageNet - 21K is a representative fine - grained classification dataset, it is famous for the exhaustiveness and wide coverage of its category labels.

The team first generated an initial region caption through the Seed - Captioner, and then used the LLM to verify the generated description with the real category label. Finally, a refined fine - grained dataset containing 456,000 samples was obtained.

Subsequently, the team combined the above two types of datasets to train a fine - grained description generation model (Fine - Grained - Captioner).

Stage 2: Support multi - region association reasoning.

To further achieve the associated understanding and reasoning of multiple regions, the team introduced the Panoptic Scene Graph (PSG) dataset.

The specific steps are as follows:

First, call the Fine - Grained - Captioner to generate a detailed description for each region;

Then, use Qwen2.5 - 72B as the "LLM Merger", and combine the original annotation information provided by the PSG dataset to generate three types of data:

1. 144,000 rich target descriptions that clearly incorporate associated context;

2. 144,000 groups of question - answer pairs for examining the ability to understand complex associations;

3. 126,000 multiple - choice questions. In this stage, an associated dataset containing 414,000 samples is finally constructed.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Surpassing NVIDIA's Describe Anything, the Chinese Academy of Sciences and ByteDance jointly propose "GAR", contributing to DeepSeek-OCR.

The trade - off between local details and global information

Seeing is believing

Fine - grained + global context