HomeArticle

SAM 3 Appears at ICLR 2026: The Next Step in Segmenting Everything – Enabling the Model to Understand "Concepts"

机器之心2025-10-13 17:13
Is Meta's "Segment Anything" up to something new again?

Name a concept, and SAM 3 will understand what you're referring to and precisely delineate the boundaries wherever it appears.

Is Meta's "Segment Anything" reaching new heights?

On September 12th, an anonymous paper titled "SAM 3: SEGMENT ANYTHING WITH CONCEPTS" was published on ICLR 2026, attracting wide attention from netizens.

  • Paper title: SAM 3: Segment Anything with Concepts
  • Paper link: https://openreview.net/forum?id=r35clVtGzw

People are speculating that this paper is from Meta, as the writing style is very similar to Meta's previous papers. Moreover, since both SAM and SAM 2 were launched by Meta, it's almost certain that SAM 3 is the official sequel to Meta's "Segment Anything" series.

In terms of timing, the appearance of this paper almost perfectly aligns with Meta's rhythm. SAM 1 was published in April 2023 and was nominated for the best paper at ICCV that year. Its concept of (zero-shot) segmenting anything made researchers exclaim that "CV" was no more, and it was hailed as the "GPT-3 moment" in the CV field.

SAM 2 was published in July 2024. Building on its predecessor, it provides real-time, promptable object segmentation for both static images and dynamic video content, unifying image and video segmentation capabilities into a powerful system.

Now, another year has passed. It seems that the debut of SAM 3 is just in time.

So, what new progress does SAM 3 bring?

It is defined as a more advanced task: Promptable Concept Segmentation (PCS).

It takes text and/or image examples as input, predicts instance masks and semantic masks for each object that matches the concept, and maintains the consistency of object identities across video frames. The focus of this work is to identify atomic visual concepts. Therefore, the input text is restricted to simple noun phrases, such as "red apples" or "striped cats". Just describe what you want, and it can find and segment every corresponding instance in the image or video.

This means that segmentation has finally learned to understand language, not through vague semantic associations, but through a minimalist understanding method rooted in vision. Name a concept, and it will understand what you're referring to and precisely delineate the boundaries wherever it appears.

Some of you may remember that SAM 1 also had a text function. What's different this time?

The paper clearly states that in SAM 1, the function of text prompts "were not fully developed". The actual focus of SAM 1 and SAM 2 was on visual prompts (such as points, boxes, and masks).

They failed to solve a more general task: finding and segmenting all instances of a concept in the input content (for example, all "cats" in a video).

In simple terms, SAM 3 allows users to upgrade from "manually pointing out each instance" to "telling the model a concept, and it will find and segment all corresponding instances for you".

SAM 3 has made progress in two aspects. In promptable visual segmentation by clicking (left image), SAM 3 outperforms SAM 2. Meanwhile, it has also made progress in promptable concept segmentation (right image). Users can specify a visual concept through a short noun phrase, an image example, or a combination of both, and segment all corresponding instances.

On the new benchmark SA-Co proposed in the paper, the performance of SAM 3 is at least twice as good as that of previous systems. It has achieved SOTA results on multiple public benchmarks. For example, on the LVIS dataset, its zero-shot mask average precision reached 47.0, while the previous best record was 38.5.

Meanwhile, the model only takes 30 milliseconds to process an image with over 100 objects on a single H200 GPU.

However, the comment section has also raised doubts about this work. Some people point out that the idea of segmenting objects based on text descriptions is not new. It has long been known as "referring segmentation" in academia, and there has been a considerable amount of research. Therefore, some people think that this work is just "renaming" and repackaging an old concept.

Some comments also suggest that Meta is just "catching up" with the open-source community, as the community has already achieved similar functions by combining different models (for example, combining a detection model with an LLM API).

Method Introduction

The paper mentions that SAM 3 is an extension of SAM 2 and has made significant breakthroughs in promptable segmentation for both images and videos.

Compared with SAM 2, SAM 3 performs better in Promptable Visual Segmentation (PVS) and sets a new standard for Promptable Concept Segmentation (PCS).

As for the PCS and PVS tasks, in simple terms, SAM 3 receives concept prompts (such as simple noun phrases like "yellow school buses" or image examples) or visual prompts (such as points, boxes, and masks) to define the objects to be segmented in space and time (segmentable individually).

It can be said that the focus of this paper is to identify atomic visual concepts, such as "red apples" or "striped cats". As shown in Figure 1, users can segment all instances of a specified visual concept through a short noun phrase, an image example, or a combination of both.

However, PCS itself has inherent ambiguity. Many concepts have multiple interpretations. For example, the phrase "small window" is very subjective (what is considered small? What is considered large?) and has blurred boundaries (does it include blinds?).

To address this issue, Meta has systematically dealt with these ambiguity problems in multiple stages, including data collection, metric design, and model training. Consistent with previous SAM versions, SAM 3 remains fully interactive, allowing users to eliminate ambiguity by adding optimization prompts and guiding the model to generate the expected output.

In terms of model architecture, SAM 3 adopts a dual encoder-decoder Transformer architecture. It is a detector with image-level recognition capabilities, which can be applied to the video field by combining with a tracker and a memory module. The detector and tracker receive visual-language input through an aligned Perceptual Encoder (PE) backbone network.

In addition, the research has also built a scalable human-machine collaborative data engine (as shown in the figure below) for annotating large-scale and diverse training datasets. Based on this system, the research has successfully annotated high-quality training data containing 4 million unique phrases and 52 million masks, as well as a synthetic dataset containing 38 million phrases and 1.4 billion masks.

Furthermore, the paper has also created the Segment Anything with Concepts (SA-Co) benchmark for the PCS task, which covers 214,000 unique concepts in 124,000 images and 1,700 videos. The number of concepts is more than 50 times that of existing benchmark datasets.

Experiments

Table 1 shows that in the zero-shot setting, SAM 3 is competitive in the bounding box detection tasks on the closed-vocabulary datasets COCO, COCO-O, and LVIS, and performs significantly better on the LVIS mask task.

On the open-vocabulary SA-Co/Gold dataset, the CGF score of SAM 3 is twice that of the strongest baseline OWLv2, and the improvement is even greater on other SA-Co subsets.

The open-vocabulary semantic segmentation experiments on ADE-847, PascalConcept-59, and Cityscapes show that SAM 3 outperforms the powerful expert baseline APE.

Few-shot adaptation. SAM 3 achieves the current optimal performance in the 10-shot setting, surpassing Gemini's context prompts and target detection expert models (such as gDino).

PCS with 1 sample. Table 3 shows that in three settings, SAM 3 significantly outperforms the previous state-of-the-art T-Rex2 on COCO (+17.2), LVIS (+9.7), and ODinW (+20.1).

Object counting. As shown in Table 4, compared with MLLM, SAM 3 not only achieves good object counting accuracy but also provides object segmentation capabilities that most MLLM cannot offer.

The performance of SAM 3 in video segmentation under text prompts. The results show that SAM 3 far outperforms the baseline, especially in benchmarks containing a large number of noun phrases.

Table 6 compares SAM 3 with advanced methods in the VOS (Video Object Segmentation) task. SAM 3 has made significant improvements over SAM 2 in most benchmarks. For the interactive image segmentation task, SAM 3 outperforms SAM 2 in terms of average mIoU.