MarkTechPost@AI 04月25日 04:05
Meta AI Releases Web-SSL: A Scalable and Language-Free Approach to Visual Representation Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta发布了Web-SSL模型,这是一系列DINO和Vision Transformer模型,参数规模从3亿到70亿不等。这些模型仅使用MetaCLIP数据集的图像子集进行训练,旨在评估纯视觉自监督学习的能力。实验结果表明,通过扩展模型规模和优化数据构成,Web-SSL在视觉问答任务中表现出色,尤其是在OCR和图表相关任务上。该研究挑战了语言监督对于多模态理解的必要性,并为未来的无语言视觉学习系统奠定了基础。

🚀 Meta发布了Web-SSL模型系列,参数规模从3亿到70亿,这些模型在Hugging Face上公开可用。它们完全基于MetaCLIP数据集的图像部分训练,用于评估大规模无语言视觉学习的潜力。

📊 Web-SSL模型在视觉问答(VQA)任务中表现出近乎对数线性的性能提升,尤其是在视觉中心和OCR及图表任务中。通过过滤训练数据,仅包含1.3%的富文本图像,Web-SSL在OCR和图表任务上甚至优于CLIP。

🔬 Web-SSL研究表明,数据集的构成、模型规模以及在不同基准上的仔细评估至关重要。即使没有语言监督,更大的视觉模型也能隐式地学习与文本语义相关的特征。

🖼️ Web-SSL模型保持了在传统基准测试(如ImageNet-1k分类、ADE20K分割和NYUv2深度估计)上的强大性能,并且在同等设置下通常优于MetaCLIP甚至DINOv2。

In recent years, contrastive language-image models such as CLIP have established themselves as a default choice for learning vision representations, particularly in multimodal applications like Visual Question Answering (VQA) and document understanding. These models leverage large-scale image-text pairs to incorporate semantic grounding via language supervision. However, this reliance on text introduces both conceptual and practical challenges: the assumption that language is essential for multimodal performance, the complexity of acquiring aligned datasets, and the scalability limits imposed by data availability. In contrast, visual self-supervised learning (SSL)—which operates without language—has historically demonstrated competitive results on classification and segmentation tasks, yet has been underutilized for multimodal reasoning due to performance gaps, especially in OCR and chart-based tasks.

Meta Releases WebSSL Models on Hugging Face (300M–7B Parameters)

To explore the capabilities of language-free visual learning at scale, Meta has released the Web-SSL family of DINO and Vision Transformer (ViT) models, ranging from 300 million to 7 billion parameters, now publicly available via Hugging Face. These models are trained exclusively on the image subset of the MetaCLIP dataset (MC-2B)—a web-scale dataset comprising two billion images. This controlled setup enables a direct comparison between WebSSL and CLIP, both trained on identical data, isolating the effect of language supervision.

The objective is not to replace CLIP, but to rigorously evaluate how far pure visual self-supervision can go when model and data scale are no longer limiting factors. This release represents a significant step toward understanding whether language supervision is necessary—or merely beneficial—for training high-capacity vision encoders.

Technical Architecture and Training Methodology

WebSSL encompasses two visual SSL paradigms: joint-embedding learning (via DINOv2) and masked modeling (via MAE). Each model follows a standardized training protocol using 224×224 resolution images and maintains a frozen vision encoder during downstream evaluation to ensure that observed differences are attributable solely to pretraining.

Models are trained across five capacity tiers (ViT-1B to ViT-7B), using only unlabeled image data from MC-2B. Evaluation is conducted using Cambrian-1, a comprehensive 16-task VQA benchmark suite encompassing general vision understanding, knowledge-based reasoning, OCR, and chart-based interpretation.

In addition, the models are natively supported in Hugging Face’s transformers library, providing accessible checkpoints and seamless integration into research workflows.

Performance Insights and Scaling Behavior

Experimental results reveal several key findings:

Importantly, WebSSL maintains strong performance on traditional benchmarks (ImageNet-1k classification, ADE20K segmentation, NYUv2 depth estimation), and often outperforms MetaCLIP and even DINOv2 under equivalent settings.

Concluding Observations

Meta’s Web-SSL study provides strong evidence that visual self-supervised learning, when scaled appropriately, is a viable alternative to language-supervised pretraining. These findings challenge the prevailing assumption that language supervision is essential for multimodal understanding. Instead, they highlight the importance of dataset composition, model scale, and careful evaluation across diverse benchmarks.

The release of models ranging from 300M to 7B parameters enables broader research and downstream experimentation without the constraints of paired data or proprietary pipelines. As open-source foundations for future multimodal systems, WebSSL models represent a meaningful advancement in scalable, language-free vision learning.


Check out the Models on Hugging Face, GitHub Page and Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Meta AI Releases Web-SSL: A Scalable and Language-Free Approach to Visual Representation Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Web-SSL 视觉自监督学习 Meta 无语言学习
相关文章