MarkTechPost@AI 前天 09:00
NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

英伟达发布了Cosmos-Reason1,这是一套旨在提升AI在物理世界中感知、理解和行动能力的模型。与传统AI不同,Cosmos-Reason1专注于处理视频等感官输入,并基于物理定律生成响应。该模型通过多模态大型语言模型,结合视觉编码器,处理大量视频-文本对数据进行训练,并引入了物理AI监督微调和强化学习阶段。其核心在于双重本体论系统,用于定义和评估物理常识和具身推理。Cosmos-Reason1在导航、机器人和自动驾驶等领域具有广泛的应用前景,标志着AI在真实世界应用中的重要进展。

🧠 Cosmos-Reason1是英伟达推出的一套多模态大型语言模型,旨在提升AI的物理推理能力,特别是在导航、机器人和自动驾驶等需要与真实物理环境交互的领域。

⚙️ 该模型的训练分为两个主要阶段:物理AI监督微调(SFT)和物理AI强化学习(RL)。这种两阶段的训练方法有助于模型更好地理解物理世界的规则。

📚 Cosmos-Reason1使用一个包含约400万个注释视频-文本对的大型数据集进行训练,这些数据包括动作描述、多项选择题和长链推理轨迹。这种大规模的训练数据为模型提供了丰富的知识基础。

💡 为了更好地组织和评估AI的物理推理能力,研究人员引入了双重本体论系统。一个本体论将物理常识分为空间、时间和基本物理学等类别,另一个本体论则映射了不同智能体(如人类、机器人手臂等)的推理能力。

📈 实验结果表明,经过强化学习(RL)阶段训练后,Cosmos-Reason1在各项基准测试中均表现出色,尤其是在预测下一步行动和验证任务完成方面。这证明了结构化本体论和多模态数据在增强AI物理推理能力方面的有效性。

AI has advanced in language processing, mathematics, and code generation, but extending these capabilities to physical environments remains challenging. Physical AI seeks to close this gap by developing systems that perceive, understand, and act in dynamic, real-world settings. Unlike conventional AI that processes text or symbols, Physical AI engages with sensory inputs, especially video, and generates responses grounded in real-world physics. These systems are designed for navigation, manipulation, and interaction, relying on common-sense reasoning and an embodied understanding of space, time, and physical laws. Applications span robotics, autonomous vehicles, and human-machine collaboration, where adaptability to real-time perception is crucial.

The current AI models’ weak connection to real-world physics is a major limitation. While they perform well on abstract tasks, they often fail to predict physical consequences or respond appropriately to sensory data. Concepts like gravity or spatial relationships are not intuitively understood, making them unreliable for embodied tasks. Training directly in the physical world is costly and risky, which hampers development and iteration. This lack of physical grounding and embodied understanding is a significant barrier to deploying AI effectively in real-world applications.

Previously, tools for physical reasoning in AI were fragmented. Vision-language models linked visual and textual data but lacked depth in reasoning. Rule-based systems were rigid and failed in novel scenarios. Simulations and synthetic data often miss the nuances of real-world physics. Critically, there was no standardized framework to define or evaluate physical common sense or embodied reasoning. Inconsistent methodologies and benchmarks made progress difficult to quantify. Reinforcement learning approaches lacked task-specific reward structures, leading to models that struggled with cause-and-effect reasoning and physical feasibility.

Researchers from NVIDIA introduced Cosmos-Reason1, a suite of multimodal large language models. These models, Cosmos-Reason1-7B and Cosmos-Reason1-56B, were designed specifically for physical reasoning tasks. Each model is trained in two major phases: Physical AI Supervised Fine-Tuning (SFT) and Physical AI Reinforcement Learning (RL). What differentiates this approach is the introduction of a dual-ontology system. One hierarchical ontology organizes physical common sense into three main categories, Space, Time, and Fundamental Physics, divided further into 16 subcategories. The second ontology is two-dimensional and maps reasoning capabilities across five embodied agents, including humans, robot arms, humanoid robots, and autonomous vehicles. These ontologies are training guides and evaluation tools for benchmarking AI’s physical reasoning.

The architecture of Cosmos-Reason1 uses a decoder-only LLM augmented with a vision encoder. Videos are processed to extract visual features, which are then projected into a shared space with language tokens. This integration enables the model to reason over textual and visual data simultaneously. The researchers curated a massive dataset comprising around 4 million annotated video-text pairs for training. These include action descriptions, multiple choice questions, and long chain-of-thought reasoning traces. The reinforcement learning stage is driven by rule-based, verifiable rewards derived from human-labeled multiple-choice questions and self-supervised video tasks. These tasks include predicting the temporal direction of videos and solving puzzles with spatiotemporal patches, making the training deeply tied to real-world physical logic.

The team constructed three benchmarks for physical common sense, Space, Time, and Fundamental Physics, containing 604 questions from 426 videos. Six benchmarks were built for embodied reasoning with 610 questions from 600 videos, covering a wide range of tasks. The Cosmos-Reason1 models outperformed previous baselines, especially after the RL phase. Notably, they improved in task completion verification, predicting next plausible actions, and assessing the physical feasibility of actions. These gains were observed in both model sizes, with Cosmos-Reason1-56B showing stronger performance across most metrics. This performance improvement underscores the effectiveness of using structured ontologies and multimodal data to enhance physical reasoning in AI.

Several Key Takeaways from the Research on Cosmos-Reason1:

In conclusion, the Cosmos-Reason1 initiative demonstrates how AI can be better equipped for the physical world. It addresses key limitations in perception, reasoning, and decision-making that have hindered progress in deploying AI in embodied scenarios. The structured training pipeline, grounded in real-world data and ontological frameworks, ensures that the models are accurate and adaptable. These advancements signal a major step forward in bridging the gap between abstract AI reasoning and the needs of systems that must operate in unpredictable, real-world environments.


Check out the Paper, Project Page, Models on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post NVIDIA Releases Cosmos-Reason1: A Suite of AI Models Advancing Physical Common Sense and Embodied Reasoning in Real-World Environments appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Cosmos-Reason1 物理AI 具身智能 英伟达
相关文章