Hello Paperspace 2024年11月27日
SAM 2: Meta's Next-Gen Model for Video and Image Segmentation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta发布了SAM 2,这是一个用于实时图像和视频目标分割的统一模型,在性能上达到了最先进的水平。SAM 2能够在图像和视频中快速识别并分离目标对象,无需针对特定图像进行训练。它可以应用于生成视频特效、改进视觉数据标注工具等领域,并显著提升了视频目标分割的效率,在多个基准测试中取得了优异的成绩。SAM 2采用了一种新颖的模型架构,包括图像编码器、记忆注意力机制、提示编码器和掩码解码器等,能够有效地处理长视频,并记住过去帧的信息,从而实现准确的目标分割。

🤔 **SAM 2 扩展了 SAM 的功能,使其能够处理图像和视频**:SAM 2可以利用点、框和掩码提示来定义需要在整个视频中分割的目标对象的范围,并在处理图像时与 SAM 类似地工作,生成分割掩码。

🎬 **SAM 2 使用记忆编码器和记忆库来存储和利用过去帧的信息**:该模型的记忆编码器会基于当前预测和先前的提示创建记忆,并将其存储在记忆库中,以便在未来的帧中使用。记忆注意力机制则将当前帧的嵌入与记忆库进行关联,从而生成传递给掩码解码器的嵌入。

🚀 **SAM 2 在视频目标分割方面显著提升了效率和性能**:与之前的交互式视频分割方法相比,SAM 2 在 17 个零样本视频数据集上取得了优异的性能,并且需要的人工交互次数减少了约三倍。此外,SAM 2 在速度方面也大幅提升,比 SAM 快了六倍,并且在 DAVIS、MOSE、LVOS 和 YouTube-VOS 等基准测试中表现出色。

💡 **SAM 2 提供了图像和视频预测 API,方便用户使用**:SAM 2 提供了类似于 SAM 的图像预测 API,以及用于处理多个目标和跟踪视频交互的视频预测 API,方便开发者在各种应用场景中使用该模型。

Introduction

The era has arrived where your phone or computer can understand the objects of an image, thanks to technologies like YOLO and SAM.

Meta's Segment Anything Model (SAM) can instantly identify objects in images and separate them without needing to be trained on specific images. It's like a digital magician, able to understand each object in an image with just a wave of its virtual wand. After the successful release of llama 3.1, Meta announced SAM 2 on July 29th, a unified model for real-time object segmentation in images and videos, which has achieved state-of-the-art performance.

SAM 2 offers numerous real-world applications. For instance, its outputs can be integrated with generative video models to create innovative video effects and unlock new creative possibilities. Additionally, SAM 2 can enhance visual data annotation tools, speeding up the development of more advanced computer vision systems.

Join our Discord Community

Get started Join the community
SAM 2 involves a task, a model, and data (Image Source)

What is Image Segmentation in SAM?

Segment Anything (SAM) introduces an image segmentation task where a segmentation mask is generated from an input prompt, such as a bounding box or point indicating the object of interest. Trained on the SA-1B dataset, SAM supports zero-shot segmentation with flexible prompting, making it suitable for various applications. Recent advancements have improved SAM's quality and efficiency. HQ-SAM enhances output quality using a High-Quality output token and training on fine-grained masks. Efforts to increase efficiency for broader real-world use include EfficientSAM, MobileSAM, and FastSAM. SAM's success has led to its application in fields like medical imaging, remote sensing, motion segmentation, and camouflaged object detection.

Dataset Used

Many datasets have been developed to support the video object segmentation (VOS) task. Early datasets feature high-quality annotations but are too small for training deep learning models. YouTube-VOS, the first large-scale VOS dataset, covers 94 object categories across 4,000 videos. As algorithms improved and benchmark performance plateaued, researchers increased the VOS task difficulty by focusing on occlusions, long videos, extreme transformations, and both object and scene diversity. Current video segmentation datasets lack the breadth needed to " segment anything in videos," as their annotations typically cover entire objects within specific classes like people, vehicles, and animals. In contrast, the recently introduced SA-V dataset focuses not only on whole objects but also extensively on object parts, containing over an order of magnitude more masks. The SA-V dataset collected comprises of 50.9K videos with 642.6K masklets.

Example videos from the SA-V dataset with masklets (Image Source)

Model Architecture

The model extends SAM to work with both videos and images. SAM 2 can use point, box, and mask prompts on individual frames to define the spatial extent of the object to be segmented throughout the video. When processing images, the model operates similarly to SAM. A lightweight, promptable mask decoder takes a frame's embedding and any prompts to generate a segmentation mask. Prompts can be added iteratively to refine the masks.

Unlike SAM, the frame embedding used by the SAM 2 decoder isn't taken directly from the image encoder. Instead, it's conditioned on memories of past predictions and prompts from previous frames, including those from "future" frames relative to the current one. The memory encoder creates these memories based on the current prediction and stores them in a memory bank for future use. The memory attention operation uses the per-frame embedding from the image encoder and conditions it on the memory bank to produce an embedding that is passed to the mask decoder.

SAM 2 Architecture. In each frame, the segmentation prediction is based on the current prompt and any previously observed memories. Videos are processed in a streaming manner, with frames being analyzed one at a time by the image encoder, which cross-references memories of the target object from earlier frames. The mask decoder, which can also use input prompts, predicts the segmentation mask for the frame. Finally, a memory encoder transforms the prediction and image encoder embeddings (not shown in the figure) for use in future frames. (Image Source)

Here’s a simplified explanation of the different components and processes present in the image:

Image Encoder

Memory Attention

Prompt Encoder and Mask Decoder

Memory Encoder and Memory Bank

Training

Overall, the model is designed to efficiently handle long videos, remember information from past frames, and accurately segment objects based on interactive prompts.

SAM 2 Performance

SAM 2 Object segmentation
SAM Comparison with SAM 2

SAM 2 significantly outperforms previous methods in interactive video segmentation, achieving superior results across 17 zero-shot video datasets and requiring about three times fewer human interactions. It surpasses SAM in its zero-shot benchmark suite by being six times faster and excels in established video object segmentation benchmarks like DAVIS, MOSE, LVOS, and YouTube-VOS. With real-time inference at approximately 44 frames per second, SAM 2 is 8.4 times faster than manual per-frame annotation with SAM.

How to install SAM 2?

Bring this project to life

To start the installation, open up a Paperspace Notebook and start the GPU machine of your choice.

# Clone the repo!git clone https://github.com/facebookresearch/segment-anything-2.git# Move to the foldercd segment-anything-2# Install the necessary requirements!pip install -e .

To use the SAM 2 predictor and run the example notebooks, jupyter and matplotlib are required and can be installed by:

pip install -e ".[demo]"

Download the checkpoints

cd checkpoints./download_ckpts.sh

How to use SAM 2?

Image prediction

SAM 2 can be used for static images to segment objects. SAM 2 offers image prediction APIs similar to SAM for these use cases. The SAM2ImagePredictor class provides a user-friendly interface for image prompting.

import torchfrom sam2.build_sam import build_sam2from sam2.sam2_image_predictor import SAM2ImagePredictorcheckpoint = "./checkpoints/sam2_hiera_large.pt"model_cfg = "sam2_hiera_l.yaml"predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):    predictor.set_image(<your_image>)    masks, _, _ = predictor.predict(<input_prompts>)

Video prediction

SAM 2 supports video predictor as well on multiple objects and also uses an inference state to keep track of the interactions in each video.

import torchfrom sam2.build_sam import build_sam2_video_predictorcheckpoint = "./checkpoints/sam2_hiera_large.pt"model_cfg = "sam2_hiera_l.yaml"predictor = build_sam2_video_predictor(model_cfg, checkpoint)with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):    state = predictor.init_state(<your_video>)    # add new prompts and instantly get the output on the same frame    frame_idx, object_ids, masks = predictor.add_new_points(state, <your prompts>):    # propagate the prompts to get masklets throughout the video    for frame_idx, object_ids, masks in predictor.propagate_in_video(state):        ...
0:00
/0:07

In the video we have used SAM 2 to segment the coffee mug

Summary

SAM 2 Limitations

Add speed and simplicity to your Machine Learning workflow today

Get startedTalk to an expert

References

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SAM 2 视频目标分割 图像分割 深度学习 计算机视觉
相关文章