MarkTechPost@AI 2024年11月23日
Apple Releases AIMv2: A Family of State-of-the-Art Open-Set Vision Encoders
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

苹果发布了AIMv2,一个旨在提升多模态理解和目标识别任务的开放式视觉编码器家族。AIMv2借鉴了CLIP等模型的经验,增加了自回归解码器,能够生成图像块和文本标记。该模型家族包含19个不同参数规模的模型,支持多种分辨率,使其适用于各种应用场景。AIMv2采用多模态自回归预训练框架,结合了视觉Transformer编码器和因果多模态解码器,在多个基准测试中表现出色,并展现出良好的可扩展性。其在Hugging Face等平台的可用性,也方便开发者和研究人员进行实验和应用。

🍎AIMv2是苹果发布的一系列开放式视觉编码器,旨在改进多模态理解和目标识别任务,它包含19个不同参数规模(300M、600M、1.2B和2.7B)的模型,并支持224、336和448像素的分辨率,以适应不同的应用场景。

💻AIMv2的核心是将视觉Transformer(ViT)编码器与因果多模态解码器相结合,采用多模态自回归预训练框架,在预训练过程中,编码器处理图像块并将其与相应的文本嵌入配对,解码器则自回归地生成图像块和文本标记,重建原始的多模态输入。

🏆在性能方面,AIMv2在大多数多模态理解基准测试中均优于OAI CLIP和SigLIP等现有模型,例如AIMv2-3B在ImageNet数据集上实现了89.5%的top-1准确率,并在开放词汇目标检测和指代表达式理解方面表现出色。

📈AIMv2的可扩展性也得到了验证,其性能随着数据和模型规模的增加而持续提升,并且易于与Hugging Face Transformers库等现代工具集成,方便开发者使用。

🚀AIMv2通过简化训练、有效扩展和多模态任务的通用性,推动了视觉编码器的发展,为开放式视觉编码器设定了新的标准,能够应对日益复杂的多模态理解任务。

Vision models have evolved significantly over the years, with each innovation addressing the limitations of previous approaches. In the field of computer vision, researchers have often faced challenges in balancing complexity, generalizability, and scalability. Many current models struggle to effectively handle diverse visual tasks or adapt efficiently to new datasets. Traditionally, large-scale pre-trained vision encoders have used contrastive learning, which, despite its success, presents challenges in scaling and parameter efficiency. There remains a need for a robust, versatile model that can handle multiple modalities—such as images and text—without sacrificing performance or requiring extensive data filtering.

AIMv2: A New Approach

Apple has taken on this challenge with the release of AIMv2, a family of open-set vision encoders designed to improve upon existing models in multimodal understanding and object recognition tasks. Inspired by models like CLIP, AIMv2 adds an autoregressive decoder, allowing it to generate image patches and text tokens. The AIMv2 family includes 19 models with varying parameter sizes—300M, 600M, 1.2B, and 2.7B—and supports resolutions of 224, 336, and 448 pixels. This range in model size and resolution makes AIMv2 suitable for different use cases, from smaller-scale applications to tasks requiring larger models.

Technical Overview

AIMv2 incorporates a multimodal autoregressive pre-training framework, which builds on the conventional contrastive learning approach used in similar models. The key feature of AIMv2 is its combination of a Vision Transformer (ViT) encoder with a causal multimodal decoder. During pre-training, the encoder processes image patches, which are subsequently paired with corresponding text embeddings. The causal decoder then autoregressively generates both image patches and text tokens, reconstructing the original multimodal inputs. This setup simplifies training and facilitates model scaling without requiring specialized inter-batch communication or extremely large batch sizes. Additionally, the multimodal objective allows AIMv2 to achieve denser supervision compared to other methods, enhancing its ability to learn from both image and text inputs.

Performance and Scalability

AIMv2 outperforms major existing models like OAI CLIP and SigLIP on most multimodal understanding benchmarks. Specifically, AIMv2-3B achieved 89.5% top-1 accuracy on the ImageNet dataset with a frozen trunk, demonstrating notable robustness in frozen encoder models. Compared to DINOv2, AIMv2 also performed well in open-vocabulary object detection and referring expression comprehension. Moreover, AIMv2’s scalability was evident, as its performance consistently improved with increasing data and model size. The model’s flexibility and integration with modern tools, such as the Hugging Face Transformers library, make it practical and straightforward to implement across various applications.

Conclusion

AIMv2 represents a meaningful advancement in the development of vision encoders, emphasizing simplicity in training, effective scaling, and versatility in multimodal tasks. Apple’s release of AIMv2 offers improvements over previous models, with strong performance on numerous benchmarks, including open-vocabulary recognition and multimodal tasks. The integration of autoregressive techniques enables dense supervision, resulting in robust and flexible model capabilities. AIMv2’s availability on platforms like Hugging Face allows developers and researchers to experiment with advanced vision models more easily. AIMv2 sets a new standard for open-set visual encoders, capable of addressing the increasing complexity of real-world multimodal understanding.


Check out the Paper and AIMv2 family of the models on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.

The post Apple Releases AIMv2: A Family of State-of-the-Art Open-Set Vision Encoders appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AIMv2 视觉编码器 多模态 开放式视觉 人工智能
相关文章