MarkTechPost@AI 2024年08月02日
Theia: A Robot Vision Foundation Model that Simultaneously Distills Off-the-Shelf VFMs such as CLIP, DINOv2, and ViT
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Theia 是一种用于机器人视觉基础的范式,它通过知识蒸馏将多个现有的视觉基础模型(VFM)整合到一个更紧凑的模型中。该模型能够从多种视觉子问题中学习到更详细的表示,从而在机器人学习任务中取得更好的性能。与其他 VFM 相比,Theia 在训练过程中的计算成本更低,并且能够提供更高质量的预训练视觉表示,从而提升机器人的学习效率。

🤖 Theia 通过知识蒸馏将多个现有的视觉基础模型(VFM)整合到一个更紧凑的模型中,例如 CLIP、DINOv2 和 ViT。这使得 Theia 能够从多个视觉子问题中学习到更详细的表示,从而在机器人学习任务中取得更好的性能。

🚀 Theia 的训练过程比以前的方法需要更少的计算资源,只需要 ImageNet 数据集和约 150 个 GPU 小时。Theia 的模型大小、空间标记使用和表示范数的熵被认为是机器人学习的关键性能因素,这表明了 Theia 的效率。

💪 Theia 的模型结构包含一个底层的视觉编码器和一个特征转换器套件。Theia 表示由一组编码标记组成,这些标记代表输入图像的补丁。这些标记可以被认为是视觉表示的“构建块”,用于捕获图像中的空间信息。研究人员选择使用空间标记是因为 DINOv2 中强大的每个补丁特征表明空间密集表示是多样化视觉理解的基础。

📊 为了评估预训练视觉表示的质量,研究人员使用了 CortexBench 中的模拟任务。这些任务包含来自 Habitat(ImageNav、ObjectNav 和 MobilePick)、Trifinger 和 MuJoCo(Adroit、DeepMind Control Suite(DMC)和 MetaWorld)的任务。一些任务是模仿学习(IL),而另一些是强化学习(RL),例如 ImageNav 和 MobilePick。

🏆 研究结果表明,将多个 VFM 整合到一个模型中可以显著提高各种机器人学习应用的性能。研究人员通过建立特征范数熵与增强下游性能之间的强相关性,回答了关于哪些类型的视觉表示能够带来更好的机器人学习的关键问题。

📚 Theia 的研究成果为未来研究提供了新的方向,这些研究旨在通过使用视觉表示来改进机器人学习。

🧠 Theia 的出现表明,通过知识蒸馏将多个 VFM 整合到一个模型中,可以有效地提高机器人视觉理解能力,并在机器人学习任务中取得更好的性能。

Visual understanding is the abstracting of high-dimensional visual signals like images and videos. Many problems are involved in this process, ranging from depth prediction and vision-language correspondence to classification and object grounding, which include tasks defined along spatial and temporal axes and tasks defined along coarse to fine granularity, like object grounding. In light of this variety, the vision community has long sought to create models well-suited to a single or small number of visual comprehension tasks. Vision foundation models (VFMs) are a group of models that have recently attained outstanding generalizability to unexplored domains and new tasks.

Learning action policies from visual inputs, as in vision-based robot policy learning, necessitates robust and varied visual perception. Although there is no universal model for vision tasks, these principles incorporate numerous implicit vision tasks, including object identification and semantic grounding, for which commercially available VFMs are suitable. When compared to visual representation models developed specifically for robot learning tasks, generic VFMs like CLIP typically need to catch up, according to the research. This shows a disparity between what robots need to learn and what anyone VFM can visually perceive. Improving training data and defining objective functions have been the primary focus of prior work on learning foundational visual representation models for robots. However, there needs to be more emphasis on improving the ability to perform various implicit visual comprehension tasks.

Proposing a unique approach, researchers from The AI Institute and Stony Brook University advocate for the consolidation of multiple large VFMs into a single, more compact model for robot learning. This is achieved through knowledge distillation, a method that allows for the enhancement of visual representation for robot learning, a task that VFMs are not typically trained for. Knowledge distillation involves transferring knowledge from a large, complex model (the ‘teacher ‘) to a smaller, simpler model (the ‘student ‘) by training the student to mimic the output of the teacher. Unlike the common method of distillation from larger to smaller models on the same task, the researchers distill VFMs that are customized for different vision tasks.

Their study presents Theia, a paradigm for robot vision foundations that simultaneously consolidates commercially available VFMs like CLIP, DINOv2, and ViT. By thoroughly grasping numerous spatially-leveled visual sub-problems, Theia produces detailed representations for use in downstream robot learning. Theia provides superior pre-trained visual representations for improved downstream robot learning performance at lower computing costs than commercially available VFMs and previous research. 

Furthermore, the proposed model, Theia, demonstrates remarkable efficiency. Previous studies required significantly more computation for training Theia; however, just ImageNet and approximately 150 GPU hours are needed for training Theia. Theia’s model size, spatial token usage, and the entropy of representation norms are identified as critical performance factors for robot learning, providing reassurance to the audience about the model’s efficiency. These findings pave the way for future studies aimed at improving robot learning using visual representations.

The proposed model comprises an underlying visual encoder and a feature translator suite. A collection of encoded tokens representing input picture patches is the Theia representation. These tokens, which can be thought of as ‘building blocks’ of the visual representation, are used to capture spatial information in the image. The robust per-patch features in DINOv2 demonstrate that spatially dense representations form the basis for diversified visual understanding, which is why the team opted for spatial tokens. The researchers aimed to remove all spatial tokens and save the [CLS] token before distillation. They began with a normalization step to ensure that the various teacher representation scales were appropriately considered. The teacher representations are normalized over each latent dimension after calculating the mean and variance from all the ImageNet training examples.

During the training process, the researchers ensured that the feature translators’ outputs match the teacher VFM representations. This was achieved by combining cosine and smooth-L1 losses to merge the ground truth and predicted versions of the same image, followed by taking the weighted average of the two.

To assess the quality of pre-trained visual representations, they employed the simulation tasks found in CortexBench. These tasks comprise a combination of those from Habitat (ImageNav, ObjectNav, and MobilePick), Trifinger, and MuJoCo (Adroit, DeepMind Control Suite (DMC), and MetaWorld). Some tasks are imitation learning (IL), while others are reinforcement learning (RL), such as ImageNav and MobilePick. The following works are taken into account: R3M, VIP, MVP, and VC-1; RADIO and E-RADIO are agglomerative models for vision tasks; and off-the-shelf vision foundation models ViT, DINOv2, and CLIP are vision foundation frameworks. For the sake of this experiment, all pre-trained representations have been frozen.

The findings of this study demonstrate that consolidating numerous VFMs into a single model significantly improves performance across various robot learning applications. By establishing a strong correlation between the entropy of feature norms and enhanced downstream performance, the researchers answer a key question about what kinds of visual representations lead to better robot learning. This not only validates the effectiveness of Theia but also provides valuable insights for future research on optimizing visual representations for robotics. 


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here


The post Theia: A Robot Vision Foundation Model that Simultaneously Distills Off-the-Shelf VFMs such as CLIP, DINOv2, and ViT appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器人视觉 视觉基础模型 知识蒸馏 Theia 机器人学习
相关文章