MarkTechPost@AI 02月23日
Meta AI Releases the Video Joint Embedding Predictive Architecture (V-JEPA) Model: A Crucial Step in Advancing Machine Intelligence
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

V-JEPA是Meta等机构推出的视频学习模型,不依赖预训练编码器等,在两百万个公开视频上训练,在多种任务中表现出色,展示了特征预测在学习有效视频表示方面的优势。

🧠V-JEPA不依赖多种传统方法,仅通过特征预测训练

💪在两百万公开视频上训练,在多任务中性能强

👍相比其他方法,在某些方面表现更优

🌟展示特征预测在视频学习中的优势

Humans have an innate ability to process raw visual signals from the retina and develop a structured understanding of their surroundings, identifying objects and motion patterns. A major goal of machine learning is to uncover the underlying principles that enable such unsupervised human learning. One key hypothesis, the predictive feature principle, suggests that representations of consecutive sensory inputs should be predictive of one another. Early methods, including slow feature analysis and spectral techniques, aimed to maintain temporal consistency while preventing representation collapse. More recent approaches incorporate siamese networks, contrastive learning, and masked modeling to ensure meaningful representation evolution over time. Instead of focusing solely on temporal invariance, modern techniques train predictor networks to map feature relationships across different time steps, using frozen encoders or training both the encoder and predictor simultaneously. This predictive framework has been successfully applied across modalities like images and audio, with models such as JEPA leveraging joint-embedding architectures to predict missing feature-space information effectively.

Advancements in self-supervised learning, particularly through vision transformers and joint-embedding architectures, have significantly improved masked modeling and representation learning. Spatiotemporal masking has extended these improvements to video data, enhancing the quality of learned representations. Additionally, cross-attention-based pooling mechanisms have refined masked autoencoders, while methods like BYOL mitigate representation collapse without relying on handcrafted augmentations. Compared to pixel-space reconstruction, predicting in feature space allows models to filter out irrelevant details, leading to efficient, adaptable representations that generalize well across tasks. Recent research highlights that this strategy is computationally efficient and effective across domains like images, audio, and text. This work extends these insights to video, showcasing how predictive feature learning enhances spatiotemporal representation quality.

Researchers from FAIR at Meta, Inria, École normale supérieure, CNRS, PSL Research University, Univ. Gustave Eiffel, Courant Institute, and New York University introduced V-JEPA, a vision model trained exclusively on feature prediction for unsupervised video learning. Unlike traditional approaches, V-JEPA does not rely on pretrained encoders, negative samples, reconstruction, or textual supervision. Trained on two million public videos, it achieves strong performance on motion and appearance-based tasks without fine-tuning. Notably, V-JEPA outperforms other methods on Something-Something-v2 and remains competitive on Kinetics-400, demonstrating that feature prediction alone can produce efficient and adaptable visual representations with shorter training durations.

The methodology involves training a foundation model for object-centric learning using video data. First, a neural network extracts object-centric representations from video frames, capturing motion and appearance cues. These representations are then refined through contrastive learning to enhance object separability. A transformer-based architecture processes these representations to model object interactions over time. The framework is trained on a large-scale dataset, optimizing for reconstruction accuracy and consistency across frames. 

V-JEPA is compared to pixel prediction methods using similar model architectures and shows superior performance across video and image tasks in frozen evaluation, except for ImageNet classification. With fine-tuning, it outperforms ViT-L/16-based models and matches Hiera-L while requiring fewer training samples. Compared to state-of-the-art models, V-JEPA excels in motion understanding and video tasks, training more efficiently. It also demonstrates strong label efficiency, outperforming competitors in low-shot settings by maintaining accuracy with fewer labeled examples. These results highlight the advantages of feature prediction in learning effective video representations with reduced computational and data requirements.

In conclusion, the study examined the effectiveness of feature prediction as an independent objective for unsupervised video learning. It introduced V-JEPA, a set of vision models trained purely through self-supervised feature prediction. V-JEPA performs well across various image and video tasks without requiring parameter adaptation, surpassing previous video representation methods in frozen evaluations for action recognition, spatiotemporal action detection, and image classification. Pretraining on videos enhances its ability to capture fine-grained motion details, where large-scale image models struggle. Additionally, V-JEPA demonstrates strong label efficiency, maintaining high performance even when limited labeled data is available for downstream tasks.


    Check out the Paper and Blog. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

    Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

    The post Meta AI Releases the Video Joint Embedding Predictive Architecture (V-JEPA) Model: A Crucial Step in Advancing Machine Intelligence appeared first on MarkTechPost.

    Fish AI Reader

    Fish AI Reader

    AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

    FishAI

    FishAI

    鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

    联系邮箱 441953276@qq.com

    相关标签

    V-JEPA 特征预测 视频学习 机器智能
    相关文章