MarkTechPost@AI 前天 16:16
Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI 推出了 V-JEPA 2,这是一个可扩展的开源世界模型,旨在通过互联网规模的视频学习,实现强大的视觉理解、未来状态预测和零样本规划。V-JEPA 2 基于联合嵌入预测架构 (JEPA),展示了如何通过被动互联网视频的自监督学习,结合最少的机器人交互数据,为智能物理代理构建模块化基础。该模型在超过 100 万小时的互联网规模视频和 100 万张图像上进行预训练,通过视觉掩码去噪目标学习,专注于可预测的场景动态,从而避免了像素级预测的低效性。V-JEPA 2-AC 变体通过 62 小时的无标签机器人视频进行微调,实现了零样本规划,并在机器人任务中表现出色。

💡 **大规模自监督预训练:** V-JEPA 2 在超过 100 万小时的互联网规模视频和 100 万张图像上进行预训练,采用了视觉掩码去噪目标,学习重建潜在表示空间中的掩码时空块,从而避免了像素级预测的低效性。

📈 **关键技术创新:** 为了实现大规模的 JEPA 预训练,研究人员引入了四项关键技术:数据缩放(VideoMix22M 数据集)、模型缩放(ViT-g 编码器)、训练策略(渐进式分辨率策略)和时空增强(64 帧,384×384 分辨率)。

👁️ **强大的视觉理解能力:** V-JEPA 2 在 Something-Something v2 基准测试中达到 77.3% 的 top-1 准确率,并在外观理解方面与 DINOv2 和 PEcoreG 等图像-文本预训练模型相当,通过注意力探针验证了自监督学习可以产生可迁移和与领域无关的视觉特征。

🤖 **零样本机器人规划:** V-JEPA 2-AC 是一个动作条件变体,通过 62 小时的无标签机器人视频进行微调,学习预测基于机器人动作和姿势的未来视频嵌入。该模型实现了零样本规划,通过模型预测控制,在未见过的机器人任务中表现出色。

🚀 **卓越的性能和效率:** V-JEPA 2-AC 在机器人任务中表现出色,例如在抓取和操作任务中优于其他模型,并且每步规划时间约为 16 秒,而 Cosmos 则需要 4 分钟,且无需校准或特定环境的微调。

Meta AI has introduced V-JEPA 2, a scalable open-source world model designed to learn from video at internet scale and enable robust visual understanding, future state prediction, and zero-shot planning. Building upon the joint-embedding predictive architecture (JEPA), V-JEPA 2 demonstrates how self-supervised learning from passive internet video, combined with minimal robot interaction data, can yield a modular foundation for intelligent physical agents.

Scalable Self-Supervised Pretraining from 1M Hours of Video

V-JEPA 2 is pretrained on over 1 million hours of internet-scale video combined with 1 million images. Using a visual mask denoising objective, the model learns to reconstruct masked spatiotemporal patches in a latent representation space. This approach avoids the inefficiencies of pixel-level prediction by focusing on predictable scene dynamics while disregarding irrelevant noise.

To scale JEPA pretraining to this level, Meta researchers introduced four key techniques:

These design choices led to an 88.2% average accuracy across six benchmark tasks—including SSv2, Diving-48, Jester, Kinetics, COIN, and ImageNet—surpassing previous baselines.

Understanding via Masked Representation Learning

V-JEPA 2 exhibits strong motion understanding capabilities. On the Something-Something v2 benchmark, it achieves 77.3% top-1 accuracy, outperforming models like InternVideo and VideoMAEv2. For appearance understanding, it remains competitive with state-of-the-art image-text pretraining models like DINOv2 and PEcoreG.

The encoder’s representations were evaluated using attentive probes, verifying that self-supervised learning alone can yield transferable and domain-agnostic visual features applicable across diverse classification tasks.

Temporal Reasoning via Video Question Answering

To assess temporal reasoning, the V-JEPA 2 encoder is aligned with a multimodal large language model and evaluated on multiple video question-answering tasks. Despite lacking language supervision during pretraining, the model achieves:

These results challenge the assumption that visual-language alignment requires co-training from the start, demonstrating that a pretrained video encoder can be aligned post hoc with strong generalization.

V-JEPA 2-AC: Learning Latent World Models for Robotic Planning

A key innovation in this release is V-JEPA 2-AC, an action-conditioned variant of the pretrained encoder. Fine-tuned using only 62 hours of unlabeled robot video from the Droid dataset, V-JEPA 2-AC learns to predict future video embeddings conditioned on robot actions and poses. The architecture is a 300M parameter transformer with block-causal attention, trained using a teacher-forcing and rollout objective.

This allows zero-shot planning through model-predictive control. The model infers action sequences by minimizing the distance between imagined future states and visual goals using the Cross-Entropy Method (CEM). It achieves high success in tasks such as reaching, grasping, and pick-and-place on unseen robot arms in different labs—without any reward supervision or additional data collection.

Benchmarks: Robust Performance and Planning Efficiency

Compared to baselines like Octo (behavior cloning) and Cosmos (latent diffusion world models), V-JEPA 2-AC:

Notably, it operates using a monocular RGB camera without calibration or environment-specific fine-tuning, reinforcing the generalization capability of the learned world model.

Conclusion

Meta’s V-JEPA 2 represents a significant advancement in scalable self-supervised learning for physical intelligence. By decoupling observation learning from action conditioning and leveraging large-scale passive video, V-JEPA 2 demonstrates that general-purpose visual representations can be harnessed for both perception and control in the real world.


Check out the Paper, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta AI Releases V-JEPA 2: Open-Source Self-Supervised World Models for Understanding, Prediction, and Planning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

V-JEPA 2 Meta AI 自监督学习 世界模型 机器人
相关文章