Meta's "Segment Anything" 3.0: Neue Semantische Segmentierung mit Konzeptionellen Hinweisen

Mehrere Entitäten können anhand von Textbeschreibungen segmentiert werden.

Conventional semantic segmentation is so boring. Adding the ability to perform semantic segmentation with concept prompts makes it so much fun, it becomes a blast. (doge)

SAM 3 – the third-generation model for "Segment Anything" has just been discovered and has already been quietly submitted to ICLR 2026.

The study is still in the double-blind review phase, and the authors are anonymized, but the title says it all.

Put simply, in this official new version, the segmentation model can finally understand human language: If you just say what you want, SAM 3 can segment the corresponding instances in images/videos.

For example, if you input "striped cat", SAM 3 can find and segment all striped cats in the image:

It's worth noting that SAM 3 only takes 30 ms to process an image with over 100 objects and also has near real-time processing capabilities for videos.

The Human-Language-Understanding SAM

SAM 1 introduced interactive segmentation tasks based on visual cues such as points, rectangles, masks, etc., and opened up a new paradigm for segmentation models; SAM 2 added support for videos and memory on this basis.

This time, SAM 3 takes this interactive segmentation a step further: It supports Multi-Instance segmentation tasks based on Concept Prompts such as phrases, image examples, etc. – yes, it overcomes the limitation of its predecessors, which could only process single instances.

In the study, the research company behind SAM 3 named this new task paradigm PCS (Promptable Concept Segmentation).

PCS: Promptable Concept Segmentation

The definition of PCS is that given an image or video, the model is able to segment all instances that match the prompting concept based on phrases, image examples, or a combination of both.

Compared with conventional segmentation tasks, PCS emphasizes:

Open Vocabularies: It is not limited to predefined fixed categories and supports the user's input of any nominal phrases as the segmentation target;

Complete Instance Segmentation: It finds and segments all instances that match the prompt and can also maintain identity consistency between different frames in videos;

Multimodal Prompts: It supports different types of prompt inputs, including text prompts, visual prompts, and a combination of both;

User Interaction: The user can fine-tune the segmentation results through interaction.

New Architecture Design

SAM 3 has designed a new architecture to implement PCS.

Especially in the detection and segmentation module, the Detector of SAM 3 is based on the DETR (Deformable Transformer) architecture and can generate instance-related detection results based on linguistic and visual cues.

In addition, the Presence Head module was introduced to decouple the tasks of object recognition (what it is) and localization (where it is) –

In conventional object detection frameworks, models often have to decide simultaneously whether a target object is present and where it is, which can lead to conflicts, especially in multi-instance segmentation tasks.

The Presence Head module handles these two tasks separately, further improving the detection accuracy of the model.

Large-Scale Data Engine

To improve PCS, the research company specially developed an expandable data engine that generates a training dataset with 4 million unique concept labels and 52 million verified masks.

The data engine consists of multiple phases and can gradually increase the diversity and difficulty of the data.

Throughout the entire development process, humans and large language models verify each other's work to ensure quality and at the same time improve the efficiency of annotation.

SA-Co-Benchmark

To evaluate the performance of the model in open-vocabulary segmentation tasks, the SA-Co (Segment Anything with Concepts) benchmark was also proposed in the study.

SA-Co includes 214,000 unique concepts, 124,000 images, and 1,700 videos, and the concept coverage can be more than 50 times higher than that of existing benchmarks.

It should be noted, however, that the language processing of SAM 3 is limited to simple phrase prompts, does not support complex linguistic expressions, and does not have the capabilities of multimodal large models for language generation, complex language understanding, and reasoning.

Experimental Results

The experimental results show that SAM 3 has improved the state-of-the-art (SOTA) performance in promptable segmentation tasks.

In the zero-shot segmentation task of the LVIS dataset, SAM 3 achieved an accuracy of 47.0, which is a significant improvement over the previous SOTA value of 38.5.

In the new SA-Co benchmark test, the performance of SAM 3 is at least twice as good as that of the baseline.

In addition, the performance of SAM 3 in the PVS (Promptable Visual Segmentation) task for videos is also better than that of SAM 2.

The researchers also combined SAM 3 with multimodal large models (MLLM) to solve more complex task requirements.

For example, the segmentation of "people sitting but not holding a gift box" in an image.

The large model first decomposes the requirement, for example, first looks for sitting people, then excludes those holding a gift box, and then gives commands to SAM 3.

The results show that the combination of SAM 3 + MLLM achieves better results than models specifically developed for reasoning segmentation and does not require special data for training.

On an H200 GPU, SAM 3 only takes 30 ms to process a single image with over 100 entities. In video tasks, the inference delay increases linearly with the number of targets, and it can maintain near real-time performance with about 5 parallel targets.

The study also points out, however, that it is difficult for SAM 3 to transfer its capabilities to specialized areas such as medical images or thermal images in zero-shot methods.

In video segmentation tasks in multi-target scenarios, the real-time performance of the model may decline, and parallel processing with multiple GPUs is required.

Link to the study: https://openreview.net/forum?id=r35clVtGzw

This article is from the WeChat account "Quantum Bit", author: Yuyang, published by 36Kr with permission.