MarkTechPost@AI 前天 23:30
Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI推出了Perception Encoder (PE),这是一个通用的视觉模型,通过单一的对比视觉语言目标进行训练,并通过针对下游任务的对齐技术进行优化。PE打破了传统的“多目标预训练”范式,证明了精心调校的训练方法和适当的对齐方法,仅靠对比学习就能产生高度泛化的视觉表征。PE模型在图像分类、检索和多模态推理方面表现出色,并且在视频任务上实现了最先进的零样本分类和检索性能。PE的发布为构建多模态AI系统提供了可复现且高效的基础。

🖼️ PE 模型家族包括 PEcoreB、PEcoreL 和 PEcoreG 三个规模,最大的 G-scale 模型包含 20 亿参数。这些模型被设计为图像和视频输入的通用编码器,在分类、检索和多模态推理方面表现出色。

⚙️ PE 的预训练遵循两阶段过程。第一阶段是在大规模图像-文本数据集上进行强对比学习,其中包括逐步分辨率缩放、大批量大小、LAMB 优化器、2D RoPE 位置编码、调整后的增强和掩蔽正则化等改进措施。

🎬 第二阶段通过视频数据引擎引入视频理解,该引擎合成高质量的视频-文本对。该流程整合了来自 Perception Language Model (PLM) 的字幕、帧级描述和元数据,然后使用 Llama 3.3 进行总结,从而使相同的图像编码器可以通过帧平均针对视频任务进行微调。

🤝 PE 采用了两种对齐策略:语言对齐,用于视觉问答和字幕等任务;空间对齐,用于检测、跟踪和深度估计,使用自蒸馏和通过 SAM2 进行的空间对应蒸馏。

🏆 在图像分类方面,PEcoreG 在 ImageNet-val 上达到 86.6%,在 ImageNet-Adversarial 上达到 92.6%,在完整的 ObjectNet 数据集上达到 88.2%,并在细粒度数据集(包括 iNaturalist、Food101 和 Oxford Flowers)上取得了有竞争力的结果。

The Challenge of Designing General-Purpose Vision Encoders

As AI systems grow increasingly multimodal, the role of visual perception models becomes more complex. Vision encoders are expected not only to recognize objects and scenes, but also to support tasks like captioning, question answering, fine-grained recognition, document parsing, and spatial reasoning across both images and videos. Existing models typically rely on diverse pretraining objectives—contrastive learning for retrieval, captioning for language tasks, and self-supervised methods for spatial understanding. This fragmentation complicates scalability and model deployment, and introduces trade-offs in performance across tasks.

What remains a key challenge is the design of a unified vision encoder that can match or exceed task-specific methods, operate robustly in open-world scenarios, and scale efficiently across modalities.

A Unified Solution: Meta AI’s Perception Encoder

Meta AI introduces Perception Encoder (PE), a vision model family trained using a single contrastive vision-language objective and refined with alignment techniques tailored for downstream tasks. PE departs from the traditional multi-objective pretraining paradigm. Instead, it demonstrates that with a carefully tuned training recipe and appropriate alignment methods, contrastive learning alone can yield highly generalizable visual representations.

The Perception Encoder operates across three scales—PEcoreB, PEcoreL, and PEcoreG—with the largest (G-scale) model containing 2B parameters. These models are designed to function as general-purpose encoders for both image and video inputs, offering strong performance in classification, retrieval, and multimodal reasoning.

Training Approach and Architecture

The pretraining of PE follows a two-stage process. The first stage involves robust contrastive learning on a large-scale curated image-text dataset (5.4B pairs), where several architectural and training enhancements improve both accuracy and robustness. These include progressive resolution scaling, large batch sizes (up to 131K), use of the LAMB optimizer, 2D RoPE positional encoding, tuned augmentations, and masked regularization.

The second stage introduces video understanding by leveraging a video data engine that synthesizes high-quality video-text pairs. This pipeline incorporates captions from the Perception Language Model (PLM), frame-level descriptions, and metadata, which are then summarized using Llama 3.3. These synthetic annotations allow the same image encoder to be fine-tuned for video tasks via frame averaging.

Despite using a single contrastive objective, PE features general-purpose representations distributed across intermediate layers. To access these, Meta introduces two alignment strategies:

Empirical Performance Across Modalities

PE demonstrates strong zero-shot generalization across a wide range of vision benchmarks. On image classification, PEcoreG matches or exceeds proprietary models trained on large private datasets such as JFT-3B. It achieves:

In video tasks, PE achieves state-of-the-art performance on zero-shot classification and retrieval benchmarks, outperforming InternVideo2 and SigLIP2-g-opt, while being trained on just 22M synthetic video-caption pairs. The use of simple average pooling across frames—rather than temporal attention—demonstrates that architectural simplicity, when paired with well-aligned training data, can still yield high-quality video representations.

An ablation study shows that each component of the video data engine contributes meaningfully to performance. Improvements of +3.9% in classification and +11.1% in retrieval over image-only baselines highlight the utility of synthetic video data, even at modest scale.

Conclusion

Perception Encoder provides a technically compelling demonstration that a single contrastive objective, if implemented with care and paired with thoughtful alignment strategies, is sufficient to build general-purpose vision encoders. PE not only matches specialized models in their respective domains but does so with a unified and scalable approach.

The release of PE, along with its codebase and the PE Video Dataset, offers the research community a reproducible and efficient foundation for building multimodal AI systems. As visual reasoning tasks grow in complexity and scope, PE provides a path forward toward more integrated and robust visual understanding.


Check out the Paper, Model, Code and Dataset. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Perception Encoder Meta AI 视觉编码器 图像 视频
相关文章