MarkTechPost@AI 2024年07月15日
Google DeepMind Researchers Present Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google DeepMind 研究人员提出了一种名为 Mobility VLA 的分层视觉-语言-行动 (VLA) 导航策略,它将环境知识和从长上下文 VLM 中进行直观推理的能力与建立在拓扑网络上的强大低级导航策略相结合。Mobility VLA 能够执行多模态用户指令,即使这些指令需要复杂的推理,并在现实世界中取得了令人鼓舞的结果。

🤔 Mobility VLA 是一种分层视觉-语言-行动 (VLA) 导航策略,结合了环境知识和长上下文 VLM 的推理能力,以及基于拓扑网络的强大低级导航策略。

🎥 Mobility VLA 利用示范性巡回视频和多模态用户指令,通过高层 VLM 定位巡回视频中的目标帧。然后,传统的低层策略使用目标帧,并从每个时间步骤的巡回帧中离线构建拓扑图。

🗺️ Mobility VLA 通过使用长上下文 VLM 解决环境理解中的保真度问题,并通过拓扑图将 VLM 训练分布连接到解决 MINT 所需的机器人动作。

🏢 Mobility VLA 在现实的 (836 平方米) 办公室环境和更住宅的环境中进行了测试,在需要复杂推理的复杂 MINT 任务上分别取得了 86% 和 90% 的成功率,远高于基线方法。

⏳ Mobility VLA 的一个挑战是长 VLM 推理时间,导致用户必须忍受机器人响应的延迟。该团队建议通过缓存巡回视频来提高推理速度。

Technological advancements in sensors, AI, and processing power have propelled robot navigation to new heights in the last several decades. To take robotics to the next level and make them a regular part of our lives, many studies suggest transferring the natural language space of ObjNav and VLN to the multimodal space so the robot can follow commands in both text and images at the same time. Researchers call this type of maritime activity Multimodal Instruction Navigation (MIN).

MIN encompasses a wide range of activities, including exploring the surroundings and following instructions for navigation. However, the use of a demonstration tour film that covers the entire region allows one to avoid investigation often altogether. 

A Google DeepMind study presents and investigates a class of tasks called Multimodal Instruction Navigation with Tours (MINT). MINT uses demonstration tours and is concerned with carrying out multimodal user instructions. The remarkable capabilities of massive Vision-Language Models (VLMs) in language and picture interpretation and common-sense reasoning have recently demonstrated considerable promise in addressing MINT. On the other hand, VLMs on their own aren’t up to the task of solving MINT because of the following reasons:

    Many VLMs have a very limited quantity of input photos because of context-length limits. Because of this, an accurate understanding of huge environments is quite limited.Computed robot actions are necessary for solving MINT. The queries used to request these kinds of activities from robots are usually separate from the distribution that VLMs are (pre)trained to handle. Consequently, zero-shot navigation performance could be better. 

To address MINT, the team provides Mobility VLA, a hierarchical Vision-Language-Action (VLA) navigation policy that integrates the knowledge of the surroundings and the ability to reason intuitively from long-context VLMs with a strong low-level navigation policy built on topological networks. The high-level VLM uses the demonstration tour video and multimodal user guidance to locate the desired frame in the tour film. Following this, a conventional low-level policy takes the goal frame and constructs a topological graph offline from the tour frames at each time step. This graph is then used to create robot actions, also called waypoints. The fidelity issue with environment understanding was tackled by employing long-context VLMs, and the topological graph connected the VLM training distribution to the robot actions needed to solve MINT.

The team’s testing of Mobility VLA in a realistic (836m2) office setting and a more residential one yielded promising results. On complex MINT problems requiring intricate thinking, Mobility VLA achieved success rates of 86% and 90%, respectively, which is significantly higher than the baseline techniques. These findings reassure us about the capabilities of Mobility VLA in real-world scenarios.

Rather than exploring its surroundings autonomously, the present version of Mobility VLA depends on a demonstration trip. On the other hand, the demonstration tour provides a great opportunity to incorporate preexisting exploration methods like frontier or diffusion-based exploration.

The researchers highlight that unnatural user interactions are hindered by long VLM inference times. Users have to endure uncomfortable waiting times for robot responses due to the inference time of high-level VLMs, which is approximately 10-30 seconds. Caching the demonstration tour—which uses up around 99.9 percent of the input tokens—can greatly enhance inference speed. 

Given the light onboard compute demand (VLMs run on clouds) and the requirement of only RGB camera observations, Mobility VLA can be implemented on numerous robot incarnations. This potential for widespread deployment of Mobility VLA is a cause for optimism and a step forward in the field of robotics and AI. 


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Google DeepMind Researchers Present Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器人导航 多模态指令 视觉-语言-行动 拓扑网络 Google DeepMind
相关文章