What is the Segment Anything Model, and how does it address real-world constraints?
Foundation models have revolutionized the field of natural language processing (NLP), allowing for zero-shot and few-shot generalization capabilities. However, recently, these models have also been applied to computer vision tasks using contrastive learning, such as image-text alignment. However, there is a need to expand foundation models for a wider range of computer vision tasks, such as image segmentation. In this research paper, the authors propose a foundation model for image segmentation, which they call the “Segment Anything” model.
The Segment Anything Model:
The Segment Anything Model (SAM) is designed to handle the promptable segmentation task while addressing real-world constraints. SAM’s architecture consists of an image encoder, a prompt encoder, and a mask decoder. The image encoder computes an image embedding, while the prompt encoder embeds prompts. The two pieces of information are combined in the lightweight mask decoder to predict segmentation masks. This architecture allows for flexible prompts, real-time mask generation, and ambiguity-aware predictions. SAM can even predict multiple masks for a single prompt, making it adaptable to ambiguous situations.
To achieve strong generalization, SAM requires a large and diverse dataset for training. However, existing segmentation datasets are insufficient, so the authors created a “data engine” that uses model-in-the-loop dataset annotation. The data engine consists of three stages: assisted-manual, semi-automatic, and fully automatic. In the first stage, SAM assists annotators in creating masks. In the second stage, SAM generates masks automatically for a subset of objects, while annotators focus on annotating the remaining objects. In the final stage, SAM generates on average about 100 high-quality masks per image using a regular grid of foreground points.
The authors created the SA-1B dataset, which includes over 1 billion masks from 11 million licensed and privacy-preserving images. This dataset is 400 times larger than any existing segmentation dataset and offers high-quality and diverse masks. The SA-1B dataset is expected to be a valuable resource for building new foundation models.
Segment Anything Task:
The Segment Anything Task is inspired by the next token prediction task used in NLP foundation models. It aims to define a segmentation task that enables zero-shot transfer to diverse downstream tasks through prompt engineering. The promptable segmentation task involves providing a prompt, such as foreground/background points, a rough box, a mask, or free-form text, to indicate what to segment in an image. The model must return a valid segmentation mask in response to any prompt, even if the prompt is ambiguous. This task ensures the model can handle ambiguity and adapt to a variety of use cases.
The pre-training algorithm simulates a sequence of prompts for each training sample and compares the model’s mask predictions against ground truth. This method is adapted from interactive segmentation, but the goal is to predict a valid mask for any prompt, even if ambiguous. The ability to handle ambiguity is crucial for use cases like automatic annotation.
A model trained on the promptable segmentation task can be adapted to various downstream tasks through prompt engineering. By providing appropriate prompts, the model can be used for a wide range of segmentation tasks. Examples include bounding box detectors for cats or automatic dataset labeling.
Segment Anything Data Engine:
The Segment Anything (SA) Data Engine is responsible for creating the 1.1 billion mask dataset, SA-1B. It consists of three stages: assisted-manual, semi-automatic, and fully automatic.
Professional annotators label masks using a browser-based interactive segmentation tool powered by SAM. The model-assisted annotation runs in real time, enabling a truly interactive experience. Annotators label objects freely without semantic constraints. As SAM improves, the average annotation time per mask decreases, and the number of masks per image increases