MarkTechPost@AI 前天 01:00
NVIDIA AI Releases Describe Anything 3B: A Multimodal LLM for Fine-Grained Image and Video Captioning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA 推出的 Describe Anything 3B (DAM-3B) 是一种多模态大语言模型,专注于为图像和视频生成详细的、区域特定的描述。该模型通过点、边界框、涂鸦或掩码指定区域,并生成上下文相关的描述性文本。DAM-3B 及其视频版本 DAM-3B-Video 能够处理静态图像和动态视频输入,模型已在 Hugging Face 上公开提供。为了克服数据稀缺问题,NVIDIA 开发了 DLC-SDP 管道,采用半监督数据生成策略,构建了包含 150 万个局部示例的训练语料库。在 DLC-Bench 评估中,DAM-3B 平均准确率达到 67.3%,在细节和精度上均优于其他模型。

🖼️ DAM-3B 核心创新在于采用了焦点提示和带有门控交叉注意力的局部视觉骨干。焦点提示将完整图像与目标区域的高分辨率裁剪融合,保留区域细节和更广泛的上下文信息,从而提升描述的准确性。

🎥 DAM-3B-Video 扩展了这一架构,用于处理时间序列,通过对逐帧区域掩码进行编码并跨时间进行集成,实现对视频中特定区域的描述,即使在遮挡或运动的情况下也能有效工作。

💡 为了克服数据稀缺问题,NVIDIA 开发了 DLC-SDP 管道,这是一种半监督数据生成策略。该管道利用分割数据集和未标记的 Web 规模图像来策划一个包含 150 万个局部示例的训练语料库,并通过自训练方法优化区域描述,从而生成高质量的标题。

🏆 在评估方面,团队引入了 DLC-Bench,该基准基于属性级别的正确性来评估描述质量。DAM-3B 在七个基准测试中表现出色,超越了 GPT-4o 和 VideoRefer 等基线模型,并在关键词级别、短语级别和多句子局部标题方面均表现出色。

🚀 DAM-3B 模型在图像和视频的局部内容描述方面具有广泛的应用前景,涵盖无障碍工具、机器人技术和视频内容分析等领域。NVIDIA 提供了这一模型作为未来研究的可靠且可复现的基准,并为下一代多模态 AI 系统提供了精细的技术方向。

Challenges in Localized Captioning for Vision-Language Models

Describing specific regions within images or videos remains a persistent challenge in vision-language modeling. While general-purpose vision-language models (VLMs) perform well at generating global captions, they often fall short in producing detailed, region-specific descriptions. These limitations are amplified in video data, where models must account for temporal dynamics. Primary obstacles include a loss of fine-grained detail during visual feature extraction, insufficient annotated datasets tailored for regional description, and evaluation benchmarks that penalize accurate outputs due to incomplete reference captions.

Describe Anything 3B—A Model Tailored for Localized Descriptions

This AI work from NVIDIA presents Describe Anything 3B (DAM-3B), a multimodal large language model purpose-built for detailed, localized captioning across images and videos. Accompanied by DAM-3B-Video, the system accepts inputs specifying regions via points, bounding boxes, scribbles, or masks and generates contextually grounded, descriptive text. It is compatible with both static imagery and dynamic video inputs, and the models are publicly available via Hugging Face.

Core Architectural Components and Model Design

DAM-3B incorporates two principal innovations: a focal prompt and a localized vision backbone enhanced with gated cross-attention. The focal prompt fuses a full image with a high-resolution crop of the target region, retaining both regional detail and broader context. This dual-view input is processed by the localized vision backbone, which embeds the image and mask inputs and applies cross-attention to blend global and focal features before passing them to a large language model. These mechanisms are integrated without inflating token length, preserving computational efficiency.

DAM-3B-Video extends this architecture to temporal sequences by encoding frame-wise region masks and integrating them across time. This allows region-specific descriptions to be generated for videos, even in the presence of occlusion or motion.

Training Data Strategy and Evaluation Benchmarks

To overcome data scarcity, NVIDIA develops the DLC-SDP pipeline—a semi-supervised data generation strategy. This two-stage process utilizes segmentation datasets and unlabeled web-scale images to curate a training corpus of 1.5 million localized examples. Region descriptions are refined using a self-training approach, producing high-quality captions.

For evaluation, the team introduces DLC-Bench, which assesses description quality based on attribute-level correctness rather than rigid comparisons with reference captions. DAM-3B achieves leading performance across seven benchmarks, surpassing baselines like GPT-4o and VideoRefer. It demonstrates strong results in keyword-level (LVIS, PACO), phrase-level (Flickr30k Entities), and multi-sentence localized captioning (Ref-L4, HC-STVG). On DLC-Bench, DAM-3B achieves an average accuracy of 67.3%, outperforming other models in both detail and precision.

Conclusion

Describe Anything 3B addresses longstanding limitations in region-specific captioning by combining a context-aware architecture with a scalable, high-quality data pipeline. The model’s ability to describe localized content in both images and videos has broad applicability across domains such as accessibility tools, robotics, and video content analysis. With this release, NVIDIA provides a robust and reproducible benchmark for future research and sets a refined technical direction for the next generation of multimodal AI systems.


Check out the Paper, Model on Hugging Face and Project Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post NVIDIA AI Releases Describe Anything 3B: A Multimodal LLM for Fine-Grained Image and Video Captioning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DAM-3B NVIDIA 多模态 图像描述 视频描述
相关文章