MarkTechPost@AI 2024年07月22日
NavGPT-2: Integrating LLMs and Navigation Policy Networks for Smarter Agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NavGPT-2 是一种将大型语言模型 (LLM) 与视觉和语言导航 (VLN) 任务相结合的新方法,旨在利用 LLMs 的语言能力来生成导航推理,并提高机器人导航的交互效率。NavGPT-2 通过将大型视觉语言模型 (VLM) 与导航策略网络相结合,有效地提升了 VLN 能力。该系统使用拓扑图导航策略,通过对智能体轨迹的记忆,实现有效的回溯功能。NavGPT-2 采用多阶段学习流程,首先进行视觉指令微调,然后将 VLM 与导航策略网络相结合。

🤔 NavGPT-2 旨在弥合 LLM 驱动的导航与专门的 VLN 模型之间的差距。它将大型语言模型 (LLM) 的语言解释能力与专门的导航策略网络相结合,以解决 LLM 在 VLN 任务中的不足,例如理解物体和智能体位置之间的空间关系,以及根据视觉环境解析歧义引用。

🚀 NavGPT-2 通过将大型视觉语言模型 (VLM) 与导航策略网络相结合,有效地提升了 VLN 能力。VLM 使用 Q-former 处理视觉观察,提取图像标记并将其输入到冻结的 LLM 中,生成导航推理。这种方法保留了 LLM 的语言解释能力,同时解决了它们对空间结构的理解有限的问题。

🧭 NavGPT-2 采用多阶段学习流程,首先进行视觉指令微调,然后将 VLM 与导航策略网络相结合。它在 R2R 数据集上进行了评估,结果表明 NavGPT-2 在成功率和数据效率方面明显优于以前基于 LLM 的方法和零样本方法。

🎯 NavGPT-2 通过克服语言在视觉中的落地、处理模糊命令和适应动态环境等挑战,为更强大、更智能的自主系统铺平了道路。

🤖 NavGPT-2 通过有效地将 LLMs 的语言能力与专门的导航策略相结合,成功地解决了将 LLMs 整合到 VLN 任务中的局限性。该模型擅长理解和响应复杂的语言指令、处理视觉信息和规划有效的导航路径。

LLMs excel in processing textual data, while VLN primarily involves visual information. Effectively combining these modalities requires sophisticated techniques to align and correlate visual and textual representations. Despite significant advancements in LLMs, a performance gap exists when these models are applied to VLN tasks compared to specialized models designed specifically for navigation. LLMs might struggle with this task like understanding spatial relationships between objects and the agent’s position and resolving ambiguous references based on visual context.

Researchers from  Adobe Research, the University of Adelaide, Australia, the Shanghai AI Laboratory, China, and the University of California, US introduced NavGPT-2 to address integrating Large Language Models (LLMs) with Vision-and-Language Navigation (VLN) tasks. The study focuses on the underutilization of LLMs’ linguistic interpretative abilities, which are crucial for generating navigational reasoning and effective interaction during robotic navigation. 

Current approaches to leveraging LLMs in VLN tasks include zero-shot methods, where LLMs are prompted with textual descriptions of the navigation environment, and fine-tuning methods, where LLMs are trained on instruction-trajectory pairs. Zero-shot methods often suffer from prompt engineering complexities and noisy data due to image captioning and summarization. Fine-tuning methods, on the other hand, fall short of VLN-specialized models’ performance due to inadequate training data and a misalignment between LLM pretraining objectives and VLN tasks. The proposed solution, NavGPT-2, aims to bridge the gap between LLM-based navigation and specialized VLN models by incorporating both LLMs and navigation policy networks effectively. 

NavGPT-2 combines a Large Vision-Language Model (VLM) with a navigation policy network to enhance VLN capabilities. The VLM processes visual observations using the Q-former, which extracts image tokens that are fed into a frozen LLM to generate navigational reasoning. This approach preserves the interpretative language capabilities of LLMs while addressing their limited understanding of spatial structures. The system employs a topological graph-based navigation policy to maintain a memory of the agent’s trajectory and enable effective backtracking. NavGPT-2’s method includes a multi-stage learning process, starting with visual instruction tuning and followed by integrating the VLM with the navigation policy network. 

The proposed model is evaluated on the R2R dataset, demonstrating NavGPT-2 significant performance compared to previous LLM-based methods and zero-shot approaches in success rates and data efficiency. For instance, it surpasses the performance of NaviLLM and NavGPT and shows competitive results compared to state-of-the-art VLN specialists like DUET.

In conclusion, NavGPT-2 successfully addresses the limitations of integrating LLMs into VLN tasks by effectively combining LLMs’ linguistic capabilities with specialized navigational policies.  It excels at understanding and responding to complex language instructions, processing visual information, and planning efficient navigation paths. By overcoming challenges like grounding language in vision, handling ambiguous commands, and adapting to dynamic environments, NavGPT-2 paves the way for more robust and intelligent autonomous systems.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here

The post NavGPT-2: Integrating LLMs and Navigation Policy Networks for Smarter Agents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NavGPT-2 大型语言模型 视觉和语言导航 机器人导航 人工智能
相关文章