MarkTechPost@AI 03月29日 12:50
UCLA Researchers Released OpenVLThinker-7B: A Reinforcement Learning Driven Model for Enhancing Complex Visual Reasoning and Step-by-Step Problem Solving in Multimodal Systems
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

加州大学洛杉矶分校(UCLA)的研究人员开发了OpenVLThinker-7B,这是一种通过结合监督微调(SFT)和强化学习(RL)来增强视觉语言模型(LVLM)复杂推理能力的新模型。该模型在多步推理任务中表现出色,尤其是在理解图表、解决视觉数学问题和解读图表方面。通过迭代训练,OpenVLThinker-7B在多个基准测试中显著超越了其基础模型,为教育、视觉分析和辅助技术应用带来了新的可能性。

💡 **模型背景与挑战:** 现有的视觉语言模型(LVLM)在多步推理方面存在局限性,难以进行需要多步逻辑推导的任务,如理解图表或解决视觉数学问题。这些模型通常缺乏内部验证和纠正推理的能力,导致输出结果不够准确和可靠。

⚙️ **OpenVLThinker-7B 的创新方法:** OpenVLThinker-7B通过结合SFT和RL,采用迭代训练循环。该方法首先使用Qwen2.5-VL-3B生成图像标题,再通过DeepSeek-R1的蒸馏版本生成结构化推理链,这些输出用于SFT,指导模型学习基本的推理结构。随后,应用GRPO进行强化学习,根据奖励反馈优化模型的推理能力。

📈 **训练过程与效果:** 该模型在训练中使用了多个阶段,包括SFT和GRPO。在SFT阶段,使用了25,000个示例进行训练,并从FigureQA、Geometry3K等数据集中筛选。GRPO阶段使用5,000个更具挑战性的样本。在MathVista基准测试中,准确率从50.2%提高到70.2%,MathVerse准确率从46.8%提升至68.5%,验证了该模型在多步推理任务中的显著提升。

🌟 **核心优势与应用:** OpenVLThinker-7B的核心优势在于其迭代结构,通过SFT和RL的循环,提升模型理解图像、问题和答案之间关系的能力。该模型在减少冗余的自我反思和提高推理链的准确性方面表现出色,为教育、视觉分析和辅助技术应用提供了新的解决方案。

Large vision-language models (LVLMs) integrate large language models with image processing capabilities, enabling them to interpret images and generate coherent textual responses. While they excel at recognizing visual objects and responding to prompts, they often falter when presented with problems requiring multi-step reasoning. Vision-language tasks like understanding charts, solving visual math questions, or interpreting diagrams demand more than recognition; they need the ability to follow logical steps based on visual cues. Despite advancements in model architecture, current systems consistently struggle to produce accurate and interpretable answers in such complex scenarios.

A major limitation in current vision-language models is their inability to perform complex reasoning that involves multiple steps of logical deduction, especially when interpreting images in conjunction with textual queries. These models cannot often internally verify or correct their reasoning, leading to incorrect or shallow outputs. Also, the reasoning chains these models follow are typically not transparent or verifiable, making it difficult to ensure the robustness of their conclusions. The challenge lies in bridging this reasoning gap, which text-only models have begun to address effectively through reinforcement learning techniques but vision-language models have yet to embrace fully.

Before this study, efforts to enhance reasoning in such systems mostly relied on standard fine-tuning or prompting techniques. Though helpful in basic tasks, these approaches often resulted in verbose or repetitive outputs with limited depth. Vision-language models like Qwen2.5-VL-7B showed promise due to their visual instruction-following abilities but lacked the multi-step reasoning comparable to their text-only counterparts, such as DeepSeek-R1. Even when prompted with structured queries, these models struggled to reflect upon their outputs or validate intermediate reasoning steps. This was a significant bottleneck, particularly for use cases requiring structured decision-making, such as visual problem-solving or educational support tools.

Researchers from the University of California, Los Angeles, introduced a model named OpenVLThinker-7B. This model was developed through a novel training method that combines supervised fine-tuning (SFT) and reinforcement learning (RL) in an iterative loop. The process started by generating image captions using Qwen2.5-VL-3B and feeding these into a distilled version of DeepSeek-R1 to produce structured reasoning chains. These outputs formed the training data for the first round of SFT, guiding the model in learning basic reasoning structures. Following this, a reinforcement learning stage using Group Relative Policy Optimization (GRPO) was applied to refine the model’s reasoning based on reward feedback. This combination enabled the model to progressively self-improve, using each iteration’s refined outputs as new training data for the next cycle.

The method involved careful data curation and multiple training phases. In the first iteration, 25,000 examples were used for SFT, sourced from datasets like FigureQA, Geometry3K, TabMWP, and VizWiz. These examples were filtered to remove overly verbose or redundant reflections, improving training quality. GRPO was then applied to a smaller, more difficult dataset of 5,000 samples. This led to a performance increase from 62.5% to 65.6% accuracy on the MathVista benchmark. In the second iteration, another 5,000 high-quality examples were used for SFT, raising accuracy to 66.1%. A second round of GRPO pushed performance to 69.4%. Across these phases, the model was evaluated on multiple benchmarks, MathVista, MathVerse, and MathVision, showing consistent performance gains with each iteration.

Quantitatively, OpenVLThinker-7B outperformed its base model, Qwen2.5-VL-7B, significantly. On MathVista, it reached 70.2% accuracy compared to the base model’s 50.2%. On MathVerse, the improvement was from 46.8% to 68.5%. MathVision full test accuracy rose from 24.0% to 29.6%, and MathVision testmini improved from 25.3% to 30.4%. These improvements indicate that the model learned to follow reasoning patterns and generalized better to unseen multimodal tasks. Each iteration of training contributed measurable gains, showcasing the strength of combining fine-tuning with reward-based learning in a looped structure.

The core of this model’s strength lies in its iterative structure. Rather than relying solely on vast datasets, it focuses on quality and structure. Each cycle of SFT and RL improves the model’s capacity to understand the relationship between images, questions, and answers. Self-verification and correction behaviors, initially lacking in standard LVLMs, emerged as a byproduct of reinforcement learning with verifiable reward signals. This allowed OpenVLThinker-7B to produce reasoning traces that were logically consistent and interpretable. Even subtle improvements, such as reduced redundant self-reflections or increased accuracy with shorter reasoning chains, contributed to its overall performance gains.

Some Key Takeaways from the Research: 


Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post UCLA Researchers Released OpenVLThinker-7B: A Reinforcement Learning Driven Model for Enhancing Complex Visual Reasoning and Step-by-Step Problem Solving in Multimodal Systems appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenVLThinker-7B 视觉语言模型 强化学习 多步推理
相关文章