HomeArticle

Surpassing CLIP, Peking University Releases an Open-Source Fine-Grained Visual Recognition Large Model, Requiring Only 4 Images per Class for Training

量子位2026-02-11 15:59
Large models can also distinguish aircraft models.

Currently, multi-modal large models perform excellently on many complex multi-modal tasks but significantly lag behind the visual encoders they rely on (such as CLIP) in fine-grained visual recognition tasks.

In response, the research team led by Professor Peng Yuxin from Peking University has conducted in-depth research in the field of fine-grained multi-modal large models. The related paper of the latest research has been accepted by ICLR 2026 and has been open-sourced.

The real world we face daily has fine-grained characteristics, which are reflected in the fact that objects in the real world usually contain extremely rich category hierarchies and a vast number of fine-grained categories. Taking airplanes as an example, as a coarse-grained major category, "airplanes" can be further subdivided into hundreds of fine-grained subcategories such as "Boeing 707", "Boeing 717", and "Boeing 727". According to statistics, the civil aircraft database has recorded more than 500 types of fixed-wing aircraft globally, and this data is still growing. Achieving fine-grained recognition of visual objects of any category has important research and application value in real production and life.

△ Figure 1. Overview of the fine-grained visual recognition large model (Fine-R1)

The fine-grained visual recognition large model aims to utilize the rich fine-grained subcategory knowledge contained in multi-modal large models and the generative category name decoding paradigm to break through the limitations of traditional recognition methods for a limited number of categories in a closed domain and achieve fine-grained recognition of visual objects of any category in an open domain.

However, the fine-grained visual recognition ability of multi-modal large models depends on a large amount of training data. Due to the high difficulty and cost of collecting fine-grained labeled data, the scale of labeled data required for large model training cannot be met. In addition, after being trained on data containing a limited number of subcategories, large models have difficulty generalizing to subcategories outside the training set and cannot recognize fine-grained subcategories in an open domain with an unlimited scope.

In response to the above problems, the research team led by Professor Peng Yuxin from Peking University proposed the fine-grained visual recognition large model Fine-R1 enhanced by chain-of-thought reasoning. Through chain-of-thought supervised fine-tuning and triplet enhancement strategy optimization, the ability of the large model to infer unseen subcategories using the existing fine-grained subcategory knowledge in the training set is improved. With only 4 training images per category, the recognition accuracy of subcategories inside and outside the training set exceeds that of discriminative models such as OpenAI's CLIP and Google DeepMind's SigLIP, demonstrating the great potential of generative multi-modal large models in solving discriminative tasks.

Two-stage solution

△ Figure 2. Framework diagram of the fine-grained visual recognition large model (Fine-R1)

As shown in Figure 2, the construction process of Fine-R1 includes two main steps:

1. Chain-of-thought supervised fine-tuning: Simulate the human thinking process and quickly build the reasoning ability of the multi-modal large model through supervised fine-tuning of the structured chain of thought.

2. Triplet enhancement strategy optimization: During the reinforcement fine-tuning process, select positive samples (the same subcategory) and negative samples (different subcategories). By introducing the thinking trajectories of positive samples, improve the robustness of the large model to intra-class differences. By maximizing the difference in the predicted distribution between the input image and the negative sample, improve the discriminability of the large model to inter-class differences.

Specifically:

Stage I: Chain-of-thought supervised fine-tuning. First, based on Qwen2.5-VL-32B, build a structured chain of thought for a small amount of fine-grained visual recognition data, and break down the reasoning process into four steps: visual analysis, candidate subcategory generation, comparative analysis, and final prediction. Then, use the chain-of-thought data to perform supervised fine-tuning on the base model, prompting the model to use the existing subcategory knowledge in the training set to generate candidate subcategories for the input image, and then lock in the final prediction result through comparative analysis.

Stage II: Triplet enhancement strategy optimization. After chain-of-thought supervised fine-tuning, in response to the problem of "large intra-class differences and small inter-class differences" in fine-grained visual recognition, further optimize the reasoning path of the model and simultaneously improve the robustness of the large model to intra-class differences and the discriminability to inter-class differences. Specifically, for each input image, match a positive sample image from the same subcategory and a negative sample image that is highly similar in appearance but belongs to a different subcategory to form a triplet, achieving intra-class enhancement and inter-class enhancement.

(1) Intra-class enhancement: Simultaneously utilize the thinking trajectories from the input image and its positive sample to capture a wider range of intra-class variations and enhance the robustness of the model to intra-class differences. Specifically, use the old model to generate two sets of thinking trajectories: the first set contains responses based on the original image-question pair, and the second set contains responses based on the positive sample image-question pair. All rewards are aggregated into a unified reward pool for subsequent calculations:

When the model produces different prediction results for the input image and the positive sample image, the difference in rewards will prompt the model to only focus on the discriminative features for identifying the subcategory and ignore other irrelevant features.

(2) Inter-class enhancement: Promote the model to generate different responses to similar images from different subcategories and enhance the discriminability of the model to inter-class differences. To quantify the model's ability to distinguish subcategories, define the following ratio:

Enhance the discriminability of the model by maximizing the KL divergence of the output distribution between the input/positive sample image and the negative sample image:

The final objective function after combining intra-class and inter-class enhancement is:

Among them, is the weight of the KL divergence, and and are the weights of the corresponding entropy terms. 𝔻KL[πθ || πθneg] = giinter(θ) − log giinter(θ) − 1. is_included(a, oᵢ) is used to detect whether the answer output by the model contains the real category name.

Experimental results

△ Table 1. Results of closed-set recognition (multiple-choice questions) of the fine-grained visual recognition large model (Fine-R1)

Table 1 shows the results of closed-set recognition (multiple-choice questions) on six authoritative fine-grained image classification datasets. With only 4 training images per category, the recognition accuracy of subcategories inside and outside the training set of Fine-R1 exceeds that of discriminative models such as OpenAI's CLIP and Google DeepMind's SigLIP.

△ Table 2. Results of open-set recognition (question-answer questions) of the fine-grained visual recognition large model (Fine-R1)

Table 2 shows the results of open-set recognition (question-answer questions), that is, without pre-giving candidate categories, allowing the large model to directly output the recognized category names. Similarly, with only 4 training images per category, the recognition accuracy of subcategories inside and outside the training set of Fine-R1 exceeds that of mainstream general multi-modal large models and reasoning large models.

△ Figure 3. Visualization results of positive and negative sample pairs (Left: Qwen2.5-VL, Right: Fine-R1)

To explore the reasons for the improvement of Fine-R1, based on the three capabilities required by multi-modal large models in fine-grained visual recognition, three hypotheses are proposed: (1) Improve the distinguishability of visual representations; (2) Improve the reserve of subcategory knowledge; (3) Improve the ability to use subcategory knowledge. Experimental analysis shows that Fine-R1 mainly improves the recognition accuracy by improving "the model's ability to use fine-grained subcategory knowledge" rather than optimizing visual representations or increasing knowledge reserves.

△ Figure 4. Case demonstration of the fine-grained visual recognition large model (Fine-R1)

The case demonstration in Figure 4 shows that Fine-R1 can accurately identify fine-grained subcategories by breaking down the thinking process into visual analysis, candidate subcategory generation, comparative analysis, and final prediction and using knowledge to reason step by step.

Paper title:

Fine-R1: Make Multi-modal LLMs Excel in Fine-Grained Visual Recognition by Chain-of-Thought Reasoning

Paper link:

https://arxiv.org/pdf/2602.07605

Open-source code:

https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026

Model address:

https://huggingface.co/collections/StevenHH2000/fine-r1

Lab website:

https://www.wict.pku.edu.cn/mipl

This article is from the WeChat official account "QbitAI". Author: FineR1 Team. Republished by 36Kr with permission.