MarkTechPost@AI 2024年08月24日
Meta Presents Sapiens: Foundation for Human Vision Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta发布了Sapiens,这是一个基于人类图像进行大规模预训练的视觉Transformer模型系列。Sapiens利用Humans-300M数据集,在人类图像上进行训练,并提供高质量的标注,能够在各种人类相关的视觉任务中实现鲁棒的泛化能力。Sapiens在姿态估计、部件分割、深度估计和法线估计等任务上表现出色,证明了特定领域预训练在计算机视觉中的有效性。

👨‍💻 **大规模预训练和高质量标注:** Sapiens利用Humans-300M数据集,包含3亿张人类图像,进行大规模预训练,并提供高质量的标注,包括308个关键点用于姿态估计和28个分割类别。这种预训练策略能够使模型在各种人类相关的视觉任务中实现鲁棒的泛化能力。

📈 **架构创新:** Sapiens的架构设计注重宽度扩展而不是深度扩展,以提高性能,同时不显著增加计算成本。此外,模型还采用了逐层学习率衰减和权重衰减优化,进一步提升了模型的性能。

🎯 **特定领域预训练的有效性:** Sapiens在姿态估计、部件分割、深度估计和法线估计等任务上表现出色,证明了特定领域预训练在计算机视觉中的有效性。Sapiens在各种真实场景中展示了强大的泛化能力,并能够以1K分辨率进行高保真推理。

🚀 **未来展望:** Sapiens作为未来下游任务的基础模型,有可能扩展到3D和多模态数据集,进一步推动人类视觉模型的发展。

🌟 **结论:** Sapiens代表了人类视觉模型的重大进步,在各种任务中展示了强大的泛化能力。其卓越的性能源于在精心策划的数据集上进行大规模预训练,高分辨率视觉Transformer以及高质量标注。Sapiens使高质量的视觉主干模型更易获得,并为未来研究和应用提供了巨大潜力。

Large-scale pretraining followed by task-specific fine-tuning has revolutionized language modeling and is now transforming computer vision. Extensive datasets like LAION-5B and JFT-300M enable pre-training beyond traditional benchmarks, expanding visual learning capabilities. Notable models such as DINOv2, MAWS, and AIM have made significant strides in self-supervised feature generation and masked autoencoder scaling. However, existing methods often overlook human-centric approaches, focusing primarily on general image pretraining or zero-shot classification.

This paper introduces Sapiens, a collection of high-resolution vision transformer models pretrained on millions of human images. Unlike previous work, which has not scaled vision transformers to the same extent as large language models, Sapiens addresses this gap by leveraging the Humans-300M dataset. This diverse collection of 300 million human images allows for the study of pre-training data distribution’s impact on downstream human-specific tasks. By emphasizing human-centric pretraining, Sapiens aims to advance the field of computer vision in areas such as 3D human digitization, keypoint estimation, and body-part segmentation, which are crucial for real-world applications.

The paper introduces a novel approach to human-centric computer vision through Sapiens, a family of vision transformer models. This approach combines large-scale pretraining on human images with high-quality annotations, achieving robust generalization, broad applicability, and high fidelity in real-world scenarios. The methodology employs simple data curation and pretraining, yielding significant performance improvements. Sapiens supports high-fidelity inference at 1K resolution, achieving state-of-the-art results on various benchmarks. As a potential foundational model for downstream tasks, Sapiens demonstrates the effectiveness of domain-specific pretraining in computer vision, with future work potentially extending to 3D and multi-modal datasets.

The Sapiens models employ a multifaceted methodology focusing on large-scale pretraining, high-quality annotations, and architectural innovations. The approach utilizes a curated dataset for human-centric tasks, emphasizing precise annotations with 308 key points for pose estimation and 28 segmentation classes. The architectural design prioritizes width scaling over depth, enhancing performance without significant computational cost increases. The methodology incorporates layer-wise learning rate decay and weight decay optimization. It emphasizes generalization across varied environments and utilizes synthetic data for depth and normal estimation. This strategic combination creates robust models capable of performing diverse human-centric tasks effectively in real-world scenarios, addressing challenges in existing public benchmarks and enhancing model adaptability.

The Sapiens models underwent comprehensive evaluation across four primary tasks: pose estimation, part segmentation, depth estimation, and normal estimation. Pretraining with the Human 300M dataset led to superior performance across all metrics. Performance was quantified using mAP for pose estimation, mIoU for segmentation, RMSE for depth estimation, and mean angular error for normal estimation. Increasing pre-training dataset size consistently improved performance, demonstrating a correlation between data diversity and model generalization. The models exhibited robust generalization capabilities across various in-the-wild scenarios. Overall, Sapiens demonstrated strong performance in all evaluated tasks, with improvements linked to pretraining data quality and quantity. These results affirm the efficacy of the Sapiens methodology in creating precise and generalizable human vision models.

In conclusion, Sapiens represents a significant advancement in human-centric vision models, demonstrating strong generalization across various tasks. Its exceptional performance stems from large-scale pretraining on a curated dataset, high-resolution vision transformers, and high-quality annotations. Positioned as a foundational element for downstream tasks, Sapiens makes high-quality vision backbones more accessible. Future work may extend to 3D and multi-modal datasets. The research emphasizes that combining domain-specific large-scale pretraining with limited high-quality annotations leads to robust real-world generalization, reducing the need for extensive annotation sets. Sapiens thus emerges as a transformative model in human-centric vision, offering significant potential for future research and applications.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 49k+ ML SubReddit

Find Upcoming AI Webinars here

The post Meta Presents Sapiens: Foundation for Human Vision Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Sapiens 人类视觉模型 视觉Transformer 计算机视觉 Meta
相关文章