MarkTechPost@AI 07月31日 04:52
NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA与台湾大学研究人员提出的ThinkAct框架,为视觉-语言-行动(VLA)推理带来了突破。该框架采用双系统架构,由多模态语言模型(MLLM)负责结构化推理并输出视觉计划隐变量,再由基于Transformer的行动模型根据此隐变量执行机器人动作。核心创新在于利用“强化视觉隐变量规划”,通过目标奖励和轨迹奖励(如DTW距离)来优化计划的准确性和物理可行性。ThinkAct在机器人操作和具身推理任务中展现出优越性能,尤其在长时域、多变任务和少样本适应方面表现突出,并能进行失败检测与自我纠正,为通用型智能机器人发展奠定基础。

✨ **双系统架构实现高效推理与控制**:ThinkAct采用“思考”与“行动”分离的双系统设计,由多模态语言模型(MLLM)进行高层次的结构化推理,生成视觉计划隐变量,而行动模型则基于此隐变量执行低层次的机器人动作。这种异步运作模式允许LLM以较慢的节奏进行思考和规划,行动模块则以更高频率执行控制,有效解决了传统端到端模型在长期规划和适应性方面的局限性。

🚀 **强化视觉隐变量规划驱动精确行动**:该框架的核心创新在于引入强化学习(RL)方法,利用与动作对齐的视觉奖励进行训练。通过“目标奖励”促使模型预测的起点和终点与演示轨迹对齐,以及“轨迹奖励”通过动态时间规整(DTW)距离使预测轨迹匹配专家演示的分布特性,ThinkAct能够生成物理上可行的机器人动作,显著提升了任务完成的成功率。

📈 **多阶段训练与优异的实验表现**:ThinkAct通过监督微调(SFT)和强化微调(RL)进行多阶段训练,并在模仿学习中进行行动适应。实验结果表明,ThinkAct在SimplerEnv和LIBERO等机器人操作基准测试中超越了现有SOTA方法,并在EgoPlan-Bench2、RoboVQA等具身推理任务中取得了领先的规划准确性和语义理解能力。此外,它还能实现有效的少样本适应,仅需少量演示即可快速学习新技能。

💡 **展现出智能的自我反思与纠正能力**:ThinkAct不仅能成功执行任务,还能展现出“自我反思”和“纠正”的能力。它能够识别执行过程中的错误(如物体掉落),并基于最近的视觉输入序列自动调整计划,以恢复并完成任务,这得益于其强大的推理能力和对环境变化的感知能力。

Estimated reading time: 5 minutes

Introduction

Embodied AI agents are increasingly being called upon to interpret complex, multimodal instructions and act robustly in dynamic environments. ThinkAct, presented by researchers from Nvidia and National Taiwan University, offers a breakthrough for vision-language-action (VLA) reasoning, introducing reinforced visual latent planning to bridge high-level multimodal reasoning and low-level robot control.

Typical VLA models map raw visual and language inputs directly to actions through end-to-end training, which limits reasoning, long-term planning, and adaptability. Recent methods began to incorporate intermediate chain-of-thought (CoT) reasoning or attempt RL-based optimization, but struggled with scalability, grounding, or generalization when confronted with highly variable and long-horizon robotic manipulation tasks.

The ThinkAct Framework

Dual-System Architecture

ThinkAct consists of two tightly integrated components:

This design allows asynchronous operation: the LLM “thinks” and generates plans at a slow cadence, while the action module carries out fine-grained control at higher frequency.

Reinforced Visual Latent Planning

A core innovation is the reinforcement learning (RL) approach leveraging action-aligned visual rewards:

Total reward rrr blends these visual rewards with a format correctness score, pushing the LLM to not only produce accurate answers but also plans that translate into physically plausible robot actions.

Training Pipeline

The multi-stage training procedure includes:

    Supervised Fine-Tuning (SFT): Cold-start with manually-annotated visual trajectory and QA data to teach trajectory prediction, reasoning, and answer formatting.Reinforced Fine-Tuning: RL optimization (using Group Relative Policy Optimization, GRPO) further incentivizes high-quality reasoning by maximizing the newly defined action-aligned rewards.Action Adaptation: The downstream action policy is trained using imitation learning, leveraging the frozen LLM’s latent plan output to guide control across varied environments.

Inference

At inference time, given an observed scene and a language instruction, the reasoning module generates a visual plan latent, which then conditions the action module to execute a full trajectory—enabling robust performance even in new, previously unseen settings.

Experimental Results

Robot Manipulation Benchmarks

Experiments on SimplerEnv and LIBERO benchmarks demonstrate ThinkAct’s superiority:

Embodied Reasoning Benchmarks

On EgoPlan-Bench2, RoboVQA, and OpenEQA, ThinkAct demonstrates:

Few-Shot Adaptation

ThinkAct enables effective few-shot adaptation: with as few as 10 demonstrations, it achieves substantial success rate gains over other methods, highlighting the power of reasoning-guided planning for quickly learning new skills or environments.

Self-Reflection and Correction

Beyond task success, ThinkAct exhibits emergent behaviors:

Ablation Studies and Model Analysis

Implementation Details

Conclusion

Nvidia’s ThinkAct sets a new standard for embodied AI agents, proving that reinforced visual latent planning—where agents “think before they act”—delivers robust, scalable, and adaptive performance in complex, real-world reasoning and robot manipulation tasks. Its dual-system design, reward shaping, and strong empirical results pave the way for intelligent, generalist robots capable of long-horizon planning, few-shot adaptation, and self-correction in diverse environments.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

You may also like NVIDIA’s Open Sourced Cosmos DiffusionRenderer [Check it now]

The post NVIDIA AI Presents ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ThinkAct 具身AI 视觉-语言-行动 强化学习 机器人规划
相关文章