Apple AI Releases Depth Pro: A Foundation Model for Zero-Shot Metric Monocular Depth Estimation

MarkTechPost@AI 2024年10月08日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

苹果公司发布了Depth Pro，一个先进的AI模型，专为零样本度量单目深度估计而设计，它通过在不到一秒的时间内生成清晰的高分辨率深度图，重塑了3D视觉领域。Depth Pro 旨在弥合传统方法的差距，在零样本条件下生成具有绝对比例的度量深度图，这意味着它可以从任意图像中创建详细的深度信息，而无需在特定领域的数据上进行额外训练。它的架构围绕着一个多尺度视觉转换器 (ViT) 设计，旨在平衡捕获全局图像上下文和保留精细结构。为了训练该模型，苹果公司使用了真实和合成数据集，并实施了两个阶段的训练课程。Depth Pro 的一个显著特点是它的零样本焦距估计能力。与许多依赖已知相机内参的先前方法不同，Depth Pro 直接从深度网络的特征中估计焦距，增强了其在各种现实世界应用中的多功能性。该模型的贡献通过广泛的实验得到验证，证明了它在多个维度上比以前的方法具有更好的性能。Depth Pro 尤其擅长边界精度和延迟，评估表明它在追踪精细结构和边界方面提供了无与伦比的精度，显著优于其他最先进的模型，例如 Marigold、Depth Anything v2 和 Metric3D v2。例如，与其他方法相比，Depth Pro 生成了更清晰的深度图，并且更准确地追踪了遮挡边界，从而产生了更干净的新视图合成。

🍎 Depth Pro 是一种先进的 AI 模型，它通过在不到一秒的时间内生成清晰的高分辨率深度图，重塑了 3D 视觉领域。

🎯 Depth Pro 旨在弥合传统方法的差距，在零样本条件下生成具有绝对比例的度量深度图，这意味着它可以从任意图像中创建详细的深度信息，而无需在特定领域的数据上进行额外训练。

🚀 Depth Pro 的架构围绕着一个多尺度视觉转换器 (ViT) 设计，旨在平衡捕获全局图像上下文和保留精细结构。它采用了一种独特的两阶段训练课程，首先在各种真实世界和合成数据集上进行训练，以实现跨域泛化的鲁棒特征学习；然后在具有像素级精确地面真值的合成数据集上进行训练，以锐化深度图，专注于高质量的边界追踪。

💡 Depth Pro 的一个显著特点是它的零样本焦距估计能力。与许多依赖已知相机内参的先前方法不同，Depth Pro 直接从深度网络的特征中估计焦距，增强了其在各种现实世界应用中的多功能性。

📊 Depth Pro 在边界精度和延迟方面表现出色，评估表明它在追踪精细结构和边界方面提供了无与伦比的精度，显著优于其他最先进的模型，例如 Marigold、Depth Anything v2 和 Metric3D v2。

⏱️ Depth Pro 在速度方面也表现出色，比专注于细粒度边界预测的模型（如 Marigold 和 PatchFusion）快了一个到两个数量级。

⚠️ 尽管 Depth Pro 表现出色，但它也有一些局限性。该模型难以处理半透明表面和体积散射，在这些情况下，定义单个像素深度变得模棱两可。

🌟 尽管存在一些局限性，但 Depth Pro 在单目深度估计方面取得了重大进展，提供了一个既高度准确又计算效率高的强大基础模型。

🌐 Depth Pro 的零样本度量深度估计、高分辨率、清晰的边界追踪和实时处理能力的结合，使其成为从图像编辑到虚拟现实的各种 3D 视觉应用的领先模型。通过消除对元数据的需求，并在不到一秒的时间内生成清晰、详细的深度图，Depth Pro 为深度估计技术树立了新的标准，使其成为计算机视觉领域开发人员和研究人员的宝贵工具。

Introduction

Traditional depth estimation methods often require metadata, such as camera intrinsics, or involve additional processing steps that limit their applicability in real-world scenarios. These limitations make it challenging to produce accurate depth maps efficiently, especially for diverse applications like augmented reality, virtual reality, and advanced image editing. To address these challenges, Apple introduced Depth Pro, an advanced AI model designed for zero-shot metric monocular depth estimation, reshaping the field of 3D vision by providing sharp, high-resolution depth maps in a fraction of a second.

Bridging the Gap in Depth Estimation

Depth Pro aims to bridge the gap in traditional methods by producing metric depth maps with absolute scale in zero-shot conditions, meaning it can create detailed depth information from an arbitrary image without additional training on domain-specific data. Inspired by previous work such as MiDaS, Depth Pro operates efficiently, generating a 2.25-megapixel depth map in just 0.3 seconds on a standard V100 GPU, demonstrating its practicality for real-time applications such as image editing, virtual reality, and augmented reality.

Architecture and Training

Depth Pro’s architecture is centered around a multi-scale vision transformer (ViT) designed to balance capturing global image context with preserving fine structures. Unlike conventional transformers, Depth Pro applies a plain ViT backbone at multiple scales and fuses predictions into a single high-resolution output, benefiting from ongoing advancements in ViT pretraining. This multi-scale approach ensures sharp boundary delineation even in complex scenarios involving thin structures such as hair and fur, which are typically challenging for monocular depth estimation models.

To train the model, Apple used both real and synthetic datasets, implementing a two-stage training curriculum. Initially, Depth Pro was trained on a diverse mix of real-world and synthetic datasets to achieve robust feature learning that generalizes well across domains. In the second stage, synthetic datasets with pixel-accurate ground truth were used to sharpen the depth maps, focusing on high-quality boundary tracing. This unique curriculum helped Depth Pro achieve superior boundary accuracy, eliminating artifacts like “flying pixels” that degrade image quality in other models.

Zero-Shot Focal Length Estimation

One of Depth Pro’s notable features is its zero-shot focal length estimation capability. Unlike many previous methods that rely on known camera intrinsics, Depth Pro estimates the focal length directly from the depth network’s features, enhancing its versatility for diverse real-world applications. This allows the model to synthesize views from arbitrary images, such as specifying a desired distance for rendering, without requiring metadata.

Performance Evaluation

The model’s contributions are validated through extensive experiments, demonstrating superior performance in comparison to prior methods across multiple dimensions. Depth Pro excels particularly in boundary accuracy and latency, with evaluations showing that it offers unparalleled precision in tracing fine structures and boundaries, significantly outperforming other state-of-the-art models such as Marigold, Depth Anything v2, and Metric3D v2. For example, Depth Pro produced sharper depth maps and more accurately traced occluding boundaries, resulting in cleaner novel view synthesis compared to other methods.

Efficiency and Limitations

The vision transformer’s efficiency is further highlighted in the speed comparison: Depth Pro is one to two orders of magnitude faster than models that focus on fine-grained boundary predictions, such as Marigold and PatchFusion. It manages this without compromising on accuracy, making it well-suited for real-time applications like interactive image generation and augmented reality experiences.

Despite its strong performance, Depth Pro has some limitations. The model struggles with translucent surfaces and volumetric scattering, where defining a single pixel depth becomes ambiguous. Nonetheless, its advancements mark a significant step forward in monocular depth estimation, providing a robust foundation model that is both highly accurate and computationally efficient.

Conclusion

Overall, Depth Pro’s combination of zero-shot metric depth estimation, high resolution, sharp boundary tracing, and real-time processing capability positions it as a leading model for a range of applications in 3D vision, from image editing to virtual reality. By removing the need for metadata and enabling sharp, detailed depth maps in less than a second, Depth Pro sets a new standard for depth estimation technology, making it a valuable tool for developers and researchers in the field of computer vision.

Check out the Paper and Model on HF. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!

The post Apple AI Releases Depth Pro: A Foundation Model for Zero-Shot Metric Monocular Depth Estimation appeared first on MarkTechPost.

Introduction

Bridging the Gap in Depth Estimation

Architecture and Training

Zero-Shot Focal Length Estimation

Performance Evaluation

Efficiency and Limitations

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签