Die GPT-Epoche der KI-Vision: Meta's neues Modell teilt die Welt mit einem Klick. Internetnutzer rufen es wahnsinnig.
Intelligence East reported on November 20th that today, Meta announced the launch of a brand - new model family, SAM 3D, and released two 3D models, namely SAM 3D Objects for object and scene reconstruction and SAM 3D Body for human body and body shape estimation.
Let's first take a look at the effects. After the user clicks on an element in the image, the SAM 3D series of models can directly extract a 3D model from a 2D image. Whether it is an object or a portrait, it can be accurately reconstructed. Even when the reconstructed model is rotated 360 degrees, it is basically flawless.
The full name of SAM is Segment Anything Model, which literally means the "segment everything" model. Meta has previously open - sourced two 2D image segmentation models, SAM 1 and SAM 2, which are benchmark works in this field.
On the same day as the release of the SAM 3D series of models, SAM 3, which had sparked heated discussions during the review period of the ICLR conference, was also officially released. The highlight of the SAM 3 image segmentation model is the introduction of a new function of "promptable concept segmentation".
In the past, most image segmentation models could only segment images based on a limited number of preset labels. However, SAM 3 allows users to input specific labels such as "dog", "elephant", "zebra", or general concepts such as "animal", or even descriptions like "a person wearing a black coat and a white hat" to complete image segmentation, which greatly improves the versatility of image segmentation models.
SAM 3 also has an extremely fast inference speed. On a single NVIDIA H200 GPU, SAM 3 can recognize an image containing more than 100 detectable objects in about 30 milliseconds.
After the release of SAM 3, Nader Khalil, a NVIDIA developer technology expert, exclaimed: "This might be the ChatGPT moment for computer vision. The powerful segmentation function means that users can train computer vision models with just one click. It's amazing."
Meta has already started to commercialize SAM 3D Objects and Sam 3. Facebook Market now offers a new "room view" function, allowing users to intuitively feel the style and suitability of home decorations in a space before purchasing furniture.
Currently, both the SAM 3D series of models and SAM 3 can be experienced in Meta's newly created Segment Anything Playground. The training and evaluation data, evaluation benchmarks, model checkpoints, inference codes, and parameterized human models of SAM 3D have all been open - sourced. SAM 3 has open - sourced model checkpoints, evaluation datasets, and fine - tuning codes.
SAM 3D blog (including papers and open - source links): https://ai.meta.com/blog/sam-3d/
SAM 3 blog (including papers and open - source links): https://ai.meta.com/blog/segment-anything-model-3/
01. Label nearly one million images and complete full - texture 3D reconstruction in seconds
In the past, 3D modeling has always faced the problem of data scarcity. Compared with abundant resources such as text and images, the real - world 3D data is extremely scarce. Most models can only handle isolated synthetic assets or reconstruct a single high - resolution object against a simple background. This makes 3D reconstruction ineffective in real - world scenarios.
The emergence of SAM 3D Objects breaks this limitation. Through a powerful data annotation engine, it achieves fine - grained annotation of 3D objects on large - scale natural images: nearly one million images, generating more than 3.14 million mesh models.
This process combines the "crowdsourcing + expert" model. Ordinary data annotators score multiple options generated by the model, and the most difficult parts are handled by senior 3D artists.
SAM 3D Objects also draws on the training concept of large language models and redefines synthetic data learning as "3D pre - training". Then, through fine - tuning in subsequent stages, the model performs excellently on real images.
This method not only improves the robustness and output quality of the model but also makes data generation more efficient, achieving a positive cycle between the data engine and model training.
To verify the results, the team also collaborated with artists to establish the SAM 3D Artist Object Dataset (SA - 3DAO), which is the first dataset specifically designed to evaluate the single - image 3D reconstruction ability in real - world images. Compared with existing benchmarks, the images and objects in this dataset are more challenging.
In terms of performance, in a one - on - one human preference test, SAM 3D Objects defeated the existing leading models with a ratio of 5:1. At the same time, by combining diffusion shortcuts and optimization algorithms, it can complete full - texture 3D reconstruction in seconds, making almost real - time 3D applications possible, such as providing instant visual perception for robots.
It can not only reconstruct the shape, texture, and pose of an object but also allows users to freely control the camera to observe the scene from different angles. This means that even when dealing with small objects, occlusions, or indirect views, SAM 3D Objects can extract 3D details from daily photos.
Of course, there is still room for improvement in this model. The output of the current model has limited resolution, and details of complex objects may be missing. At the same time, the object layout prediction still focuses on single objects, and the inference of physical interactions between multiple objects has not been realized.
In the future, by increasing the resolution and adding joint inference of multiple objects, SAM 3D Objects is expected to achieve more detailed and natural 3D reconstruction in real - world scenarios.
02. More interactive and controllable 3D reconstruction, equipped with a new open - source 3D format
SAM 3D Objects is mainly aimed at 3D reconstruction of objects, while SAM 3D Body focuses on 3D reconstruction of the human body. SAM 3D Body can accurately estimate the 3D pose and shape of the human body from a single image. Even in the face of abnormal postures, partial occlusions, or complex scenarios with multiple people, it can output stably.
Notably, SAM 3D Body supports prompt input. Users can guide and control the model's prediction through methods such as segmentation masks and 2D key points, making 3D reconstruction more interactive and controllable.
The core of SAM 3D Body is an open - source 3D mesh format called Meta Momentum Human Rig (MHR), which separates the human body's skeletal structure from the soft - tissue shape, thereby improving the interpretability of the model output.
The model uses a Transformer Encoder - Decoder architecture. The image encoder can capture high - resolution details of various parts of the body, while the mesh decoder supports prompt - based 3D mesh prediction. This design allows users not only to obtain an accurate 3D human body model but also to flexibly adjust and fine - tune the results during interaction.
In terms of data, the research team of SAM 3D Body integrated billions of images, high - quality multi - camera videos, and professional synthetic data. Through an automated data engine, high - value images such as those with rare postures, occlusions, or complex clothing were selected, forming approximately 8 million high - quality training samples.
This data strategy enables the model to maintain strong robustness in diverse scenarios. At the same time, combined with prompt - based multi - step refinement training, the 3D prediction is more accurately aligned with 2D visual evidence.
The released benchmark results show that SAM 3D Body has significant advantages in multiple 3D human body benchmark tests, leading previous models in both accuracy and robustness.
In addition, the team also open - sourced the MHR model. This parameterized human body model can be used under a commercial license, enabling Meta's technologies such as Codec Avatars to be applied in practice.
SAM 3D Body mainly focuses on single - person processing and does not yet support interaction prediction between multiple people or between people and objects. This limits the accurate inference of relative positions and physical interactions. In addition, the accuracy of hand pose estimation is still behind that of specialized hand pose estimation methods.
In the future, SAM 3D Body plans to incorporate interactions between people, objects, and the environment into training and improve the accuracy of hand pose reconstruction, making the model more comprehensive and natural in real - world scenarios.
03. Enhanced segmentation flexibility, with AI deeply involved in data construction
If the SAM 3D series of models represents Meta's first breakthrough in the field of 3D visual reconstruction, then SAM 3 is a continuation of Meta's exploration in the field of 2D image segmentation.
SAM 3 is a unified model that can detect, segment, and track objects based on text, example images, or visual prompts. Its openness and interactivity increase the possibilities for visual creation and scientific research.
Through "promptable concept segmentation", SAM 3 can recognize more complex and subtle concepts, such as "a striped red umbrella" or "a sitting person not holding a gift box in their hand".
To measure the large - vocabulary segmentation performance, Meta also launched the Segment Anything with Concepts (SA - Co) dataset. This benchmark covers far more data concepts than before and conducts challenge tests on open - ended concept segmentation in images and videos.
The SAM 3 model supports various prompt forms, including text phrases, example images, and visual prompts (such as masks and box - selected points), enhancing segmentation flexibility.
The test results announced by Meta show that SAM 3 has achieved approximately 100% improvement in concept segmentation performance on the SA - Co benchmark. In the user preference test, compared with the strongest competing model OWLv2, the output of SAM 3 is more favored, with a ratio of approximately 3:1 (SAM 3:OWLv2).
In addition, SAM 3 also maintains a leading performance in the traditional SAM 2 visual segmentation tasks and has made significant progress in challenging tasks such as zero - shot LVIS and object counting.
In terms of data construction, SAM 3 uses a data engine that combines humans and AI. This process includes using the SAM 3 and Llama 3.2v models to automatically generate initial segmentation masks and labels, and then having human and AI annotators verify and correct them.
AI annotators can not only improve the annotation speed (about 400% faster for negative samples and about 36% faster for positive samples) but also automatically filter simple samples, concentrating human resources on the most challenging cases.
At the same time, Meta uses a concept ontology (a concept dictionary based on Wikipedia) to expand the data coverage, enabling rare concepts to also receive annotation support.
Ablation experiments show that the strategy of combining AI and human annotation can significantly improve model performance and provide a feasible way for automatic data generation in new visual domains.
In terms of model architecture, SAM 3 combines several advanced technologies: the text and image encoders are based on the Meta Perception Encoder, the detector uses the DETR architecture, and the tracking component continues the memory module of SAM 2.
By using a unified architecture to handle detection, segmentation, and tracking tasks, SAM 3 avoids conflicts between tasks when dealing with complex visual tasks and maintains high performance and efficient training.
There is still room for improvement in SAM 3 in some extreme scenarios, such as recognizing professional terms (such as "platelets") in zero - shot situations or processing long and complex text descriptions. In video scenarios, SAM 3 processes each object separately, so the efficiency and performance in multi - object scenarios can still be optimized.
Meta provides model fine - tuning methods and tools, encouraging the open - source community to adapt and expand for specific tasks and visual domains.
04. Conclusion: Generative AI is changing the game of CV
The rise of generative AI is feeding back into the previous wave of AI centered around computer vision. From dataset creation to innovation in model training methods, generative AI has expanded the capabilities of CV models and brought more innovative gameplay.
In addition, we can also see that Meta has actively applied relevant technologies to real - world businesses. With the accumulation of data and user feedback, the SAM and SAM 3D series of models may bring us more surprises.