MarkTechPost@AI 6小时前
VL-Cogito: Advancing Multimodal Reasoning with Progressive Curriculum Reinforcement Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

VL-Cogito是由达摩院及其合作伙伴提出的一款先进的多模态大语言模型(MLLM),它引入了一种强大的强化学习流程,系统性地提升了模型在数学、科学、逻辑、图表和通用理解等多方面的推理技能。其核心创新在于渐进式课程强化学习(PCuRL)框架,通过在线难度软加权(ODSW)和动态长度奖励(DyLR)机制,有效解决了多模态推理中的不稳定性与领域鸿沟问题。PCuRL框架分阶段进行训练,从易到难,确保模型能够循序渐进地掌握复杂推理任务,并在多个基准测试中取得了显著的性能提升,尤其是在数学和科学推理方面,展现了其在提升模型深度分析能力方面的潜力。

🌟 **核心创新:渐进式课程强化学习(PCuRL)框架** VL-Cogito的核心突破在于其PCuRL框架,该框架旨在系统性地解决多模态推理中的固有不稳定性问题和领域差距。它包含两项关键创新:在线难度软加权(ODSW)和动态长度奖励(DyLR)。ODSW根据训练样本的难度和模型的当前能力动态调整权重,使模型能从清晰的案例逐步过渡到复杂和挑战性的任务。DyLR则为每个提示计算理想的目标长度,鼓励在简单任务上进行快速推理,在复杂任务上进行多步探索,从而平衡效率与正确性。

🚀 **先进的训练流程与技术细节** VL-Cogito直接从Qwen2.5-VL-Instruct-7B骨干模型开始强化学习后训练,无需初始监督微调(SFT)。PCuRL过程被明确划分为三个连续的强化学习阶段:简单、中等和困难。在每个阶段,模型都暴露于各种泛化挑战,ODSW的加权函数会偏向目标难度,而在困难阶段,DyLR会触发自适应的推理链扩展。技术配置包括AdamW优化器、低学习率(1e-6)、DeepSpeed-ZeRO3以及一系列精心设置的超参数,以支持高效训练。

📚 **精选数据集与挑战性任务** 该模型使用了精心策划的训练集,涵盖了23个开源多模态数据集,横跨数学推理、逻辑推理、计数、科学推理、图表理解和通用图像理解六大任务类别。为了防止模型利用表面的多项选择线索,所有样本都被重构为开放式问答格式。通过对Qwen2.5-VL-7B-Instruct进行试用,并过滤掉其准确率超过50%的样本,确保了训练数据集仅包含真正具有挑战性的任务,从而激发模型进行更深入的推理。

🏆 **优异的基准测试表现与消融实验** VL-Cogito在包括Geometry@3K、MathVista、ChartQA和ScienceQA在内的十项基准测试中进行了评估,其绝对准确率相比骨干模型有了显著提升,并在6项基准测试中取得了最先进的成果,尤其是在严谨的数学和科学任务上。消融实验表明,单独的课程强化学习就能提升平均分数,而动态长度奖励进一步增强了性能,ODSW则在数据不平衡或倾斜的情况下表现优于二元硬样本过滤。

💡 **关键洞察与未来方向** VL-Cogito的研究验证了几个关键见解:学习过程中的中间难度提示最有利于模型进步;挑战性任务的暴露能催化深度推理;奖励的精细化(结合正确性、格式和长度)对于实现细致、情境敏感的推理至关重要。此外,无需SFT的冷启动强化学习是可行且高效的,PCuRL为构建鲁棒的多模态推理模型提供了一个通用的路线图。

Multimodal reasoning, where models integrate and interpret information from multiple sources such as text, images, and diagrams, is a frontier challenge in AI. VL-Cogito is a state-of-the-art Multimodal Large Language Model (MLLM) proposed by DAMO Academy (Alibaba Group) and partners, introducing a robust reinforcement learning pipeline that fundamentally upgrades the reasoning skills of large models across mathematics, science, logic, charts, and general understanding.

Core Innovations

VL-Cogito’s unique approach centers around the Progressive Curriculum Reinforcement Learning (PCuRL) framework, engineered to systematically overcome the instability and domain gaps endemic to multimodal reasoning. The framework includes two breakthrough innovations:

Training Pipeline

VL-Cogito’s RL post-training starts directly from the Qwen2.5-VL-Instruct-7B backbone, with no initial supervised fine-tuning (SFT) cold start required. The PCuRL process is explicitly divided into three sequential RL stages: easy, medium, and hard. In each stage:

Technical setup details:

Dataset Curation and RL Data Sampling

A meticulously curated training set covers 23 open-source multimodal datasets across six task categories: Mathematical Reasoning, Logical Reasoning, Counting, Science Reasoning, Chart Understanding, and General Image Understanding.

Evaluation and Benchmark Results

Performance Across Benchmarks

VL-Cogito is benchmarked against both general-purpose and reasoning-oriented MLLMs on a ten-task panel, including datasets like Geometry@3K, MathVerse, MathVista, ChartQA, ScienceQA, MMMU, EMMA, and MMStar.

ModelGeo3KMathVerseMathVistaMathVisionLogicVistaChartQASciQAMMMUEMMAMMStar
VL-Cogito (7B)68.753.374.830.748.983.487.652.629.166.3
VL-Rethinker (7B)67.754.673.730.145.783.586.752.928.664.2
MM-Eureka (8B)67.252.373.429.447.182.786.452.327.464.7
Qwen2.5-VL (7B)61.650.469.328.744.082.485.450.924.662.5

Component-wise Ablation

Reasoning Efficiency and Training Dynamics

Case Studies

VL-Cogito exhibits detailed, self-reflective, stepwise reasoning. For math, the model decomposes solutions into granular chains and actively corrects missteps, a behavior instilled by RL verification and advantage estimation[1, Figure 5]. On classification-style problems (e.g., identifying decomposers or skyscrapers in images), it methodically considers each option before boxing the answer, demonstrating strong multimodal comprehension and process reliability.

Insights and Impact

VL-Cogito’s systematic PCuRL pipeline validates several key insights:

Conclusion

VL-Cogito’s architecture and training innovations set a new standard for multimodal reasoning across diverse benchmarks. The design and empirical validation of progressive curriculum RL with dynamic length rewards point toward a general roadmap for robust reasoning in multimodal models.


Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post VL-Cogito: Advancing Multimodal Reasoning with Progressive Curriculum Reinforcement Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

VL-Cogito 多模态推理 强化学习 AI模型 LLM
相关文章