HomeArticle

Meta's "Segment Anything" 3.0 is exposed. Concept prompts are added to the skill of semantic segmentation. It's so much fun and is going to be a hit!

量子位2025-10-13 11:46
Multiple entities can be segmented based on text descriptions.

Traditional semantic segmentation is so boring. With the addition of concept prompts, skill semantic segmentation is so much fun, it's amazing. (doge)

SAM 3 — the third-generation "Segment Anything" model has just been discovered and has quietly submitted to ICLR 2026.

The paper is still in the double-blind review stage, and the authors are anonymous, but the title reveals everything.

Simply put, in this official new version, the segmentation model can finally understand human language: just say what you want, and SAM 3 can segment the corresponding instances in the image/video.

For example, input "striped cat", and SAM 3 can find and segment all the striped cats in the picture by itself:

It's worth mentioning that SAM 3 only takes 30ms to process an image with more than 100 objects and has near-real-time processing capabilities for videos.

The SAM That Can Understand Human Language

SAM 1 introduced interactive segmentation tasks based on visual prompts such as points, boxes, and masks, opening up a new paradigm for segmentation models; SAM 2 added support for videos and memory on this basis.

This time, SAM 3 takes this interactive segmentation a step further: it supports multi-instance segmentation tasks based on concept prompts such as phrases and image examples — yes, it also breaks through the limitation of previous generations that could only handle single instances.

In the paper, the research team of SAM 3 named this new task paradigm PCS (Promptable Concept Segmentation).

PCS: Promptable Concept Segmentation

The definition of PCS is that given an image or video, the model can segment all instances that match the prompted concept based on phrases, image examples, or a combination of both.

Compared with traditional segmentation tasks, PCS emphasizes:

Open vocabulary: It is not limited to predefined fixed categories and supports users to input any noun phrase as the segmentation target;

Full instance segmentation: Find and segment all instances that match the prompt, and maintain identity consistency between different frames in the video;

Multi-modal prompts: Support multiple prompt inputs, including text prompts, visual prompts, and a combination of both;

User interaction: Allow users to finely optimize the segmentation results through interaction.

New Architecture Design

SAM 3 designed a new architecture to implement PCS.

Mainly in the detection and segmentation module, the detector of SAM 3 is based on the DETR (Deformable Transformer) architecture and can generate instance-level detection results based on language and visual prompts.

At the same time, the Presence Head module is introduced to decouple the tasks of object recognition (what it is) and positioning (where it is) —

In the traditional object detection framework, the model often needs to simultaneously judge whether the target exists and where it is, which may lead to conflicts, especially in multi-instance segmentation tasks.

The Presence Head processes the two separately, further improving the detection accuracy of the model.

Large-Scale Data Engine

To improve PCS, the research team also specifically built a scalable data engine to generate a training dataset covering 4 million unique concept labels and 52 million verified masks.

The data engine consists of multiple stages, which can gradually improve the diversity and difficulty of the data.

During the entire construction process, humans and large language models will check each other's work, ensuring high quality while improving the annotation efficiency.

SA-Co Benchmark

To evaluate the performance of the model in open-vocabulary segmentation tasks, the paper also proposes the SA-Co (Segment Anything with Concepts) benchmark.

SA-Co contains 214K unique concepts, 124K images, and 1.7K videos, and the concept coverage can reach more than 50 times that of existing benchmarks.

It should be noted that SAM 3's language processing is still limited to simple phrase prompts, does not support complex language expressions, and does not have the language generation, complex language understanding, and reasoning capabilities of multi-modal large models.

Experimental Results

The experimental results show that SAM 3 has refreshed the SOTA in promptable segmentation tasks.

In the zero-shot segmentation task of the LVIS dataset, the accuracy of SAM 3 reached 47.0, which is a significant improvement compared to the previous SOTA of 38.5.

In the new SA-Co benchmark test, SAM 3 performs at least twice as well as the baseline method.

In addition, in the PVS (Promptable Visual Segmentation) task for videos, the performance of SAM 3 is also better than that of SAM 2.

The researchers also combined SAM 3 with multi-modal large models (MLLM) to explore solutions for more complex task requirements.

For example, segment "people sitting but not holding gift boxes" in the picture.

The large model will first break down the requirements, such as finding people sitting first, then excluding those holding gift boxes, and then sending instructions to SAM 3.

The results show that the combination of SAM 3 + MLLM performs better than models specifically designed for reasoning segmentation, and does not require specialized data for training.

On the H200 GPU, SAM 3 only takes 30ms to process a single image with more than 100 entities. In video tasks, the inference latency increases linearly with the number of targets and can maintain near-real-time performance with about 5 concurrent targets.

However, the paper also points out that it is difficult for SAM 3 to generalize its capabilities to niche fields such as medical images and thermal imaging in a zero-shot manner.

In video segmentation tasks, in multi-target scenarios, the real-time performance of the model will decline, and multi-GPU parallel processing is required.

Paper link: https://openreview.net/forum?id=r35clVtGzw

This article is from the WeChat official account “QbitAI”, author: Yuyang. Republished by 36Kr with permission.