MarkTechPost@AI 02月05日
Deep Agent Released R1-V: Reinforcing Super Generalization in Vision-Language Models with Cost-Effective Reinforcement Learning to Outperform Larger Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Deep Agent发布了R1-V,这是一种新型强化学习方法,旨在提高视觉语言模型(VLMs)的泛化能力,同时保持成本效益。R1-V通过采用强化学习技术,指导模型学习可泛化的技能,而非简单地记忆训练示例,从而解决了VLMs在训练数据之外实现稳健泛化的挑战。它特别关注培养VLMs强大的视觉计数能力,这在图像识别、自动驾驶系统和视觉推理等人工智能应用中至关重要。实验表明,R1-V在处理超出分布(OOD)数据时,其有效性和稳健性均优于传统的链式思考监督微调(CoT-SFT)方法,且训练成本极低。

🎯R1-V的核心目标是增强视觉语言模型(VLMs)在训练数据集之外的泛化能力,通过强化学习技术引导模型学习可泛化的技能,而非仅仅记忆训练样本,从而解决了VLMs在面对新场景时的泛化难题。

💰R1-V在训练效率方面表现出色,仅使用一个包含20亿参数的相对较小模型,就在超出分布(OOD)测试中优于一个包含720亿参数的更大模型,这表明模型大小并非性能的唯一决定因素,训练方法和强化学习策略在提高模型能力方面至关重要。该模型在八个A100 GPU上训练了30分钟,总计算成本仅为2.62美元,为希望在不投入大量计算资源的情况下实现高性能的研究人员和开发人员提供了一个有吸引力的选择。

📚R1-V的成功还得益于其精心策划的训练数据集,该模型使用CLEVR-70k和R1-Distilled Visual Reasoning数据集进行训练,这些数据集专门设计用于鼓励视觉推理和稳健的决策制定,确保模型能够深入理解视觉关系和逻辑推理,而不是简单地学习识别给定数据集中的模式。

🌐R1-V通过公开发布其代码、模型权重、数据集和训练脚本来支持开源人工智能研究,这使得人工智能研究界能够进一步完善和改进视觉语言建模。

Vision-language models (VLMs) face a critical challenge in achieving robust generalization beyond their training data while maintaining computational resources and cost efficiency. Approaches, such as chain-of-thought supervised fine-tuning (CoT-SFT), often lead to overfitting, where models perform well on seen data but struggle with new, unseen scenarios. This limitation reduces their effectiveness in applications that demand adaptability, such as autonomous systems, medical imaging, and visual reasoning tasks. Also, the prevailing assumption is that increasing model size is the key to improved performance. The need for a more efficient training paradigm that enhances generalization, minimizes overfitting and reduces computational costs has become crucial for advancing VLMs.

Deep Agent released R1-V to resolve some of the above concerns. This novel reinforcement learning approach enhances the generalization ability of VLMs while being cost-effective. This approach demonstrates how reinforcement learning with verifiable rewards (RLVR) can outperform traditional CoT-SFT in effectiveness and robustness when dealing with out-of-distribution (OOD) data.

The main objective of the R1-V approach is to enhance VLMs’ ability to generalize beyond their training datasets. R1-V tackles this issue by employing reinforcement learning techniques that guide the model to learn generalizable skills rather than memorizing training examples. In particular, it focuses on teaching VLMs to develop robust visual counting abilities, an essential skill in many AI applications, including image recognition, autonomous systems, and visual reasoning.

A major highlight of R1-V is its training efficiency. Despite utilizing a relatively small model with only 2 billion parameters, R1-V performs better than a significantly larger 72 billion parameter model in OOD tests. This demonstrates that model size is not the sole determinant of performance; the training methodology and reinforcement learning strategies are crucial in enhancing a model’s capabilities.

R1-V was trained on eight A100 GPUs for 30 minutes, with a total computational cost of only $2.62. This cost-effectiveness makes it an attractive alternative for researchers and developers who wish to achieve high performance without extensive computational resources. R1-V also stands out due to its reliance on a curated training dataset. The model was trained using CLEVR-70k and R1-Distilled Visual Reasoning datasets, specifically designed to encourage visual reasoning and robust decision-making. Using these datasets ensures that the model develops a deep understanding of visual relationships and logical reasoning rather than simply learning to recognize patterns from a given dataset.

In conclusion, the development of R1-V supports open-source AI research by making its code, model weights, datasets, and training scripts publicly available. This allows the AI research community to refine and improve vision-language modeling. R1-V’s reinforcement learning approach enables rapid learning of patterns and structures in data. It leads to high performance with minimal computational cost. This challenges the assumption that extensive training and massive datasets are necessary for state-of-the-art AI performance. Instead, efficient training methodologies can reduce computational demands while maintaining or surpassing traditional results.


Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.

The post Deep Agent Released R1-V: Reinforcing Super Generalization in Vision-Language Models with Cost-Effective Reinforcement Learning to Outperform Larger Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视觉语言模型 强化学习 泛化能力 R1-V 开源AI
相关文章