MarkTechPost@AI 19小时前
Incorrect Answers Improve Math Reasoning? Reinforcement Learning with Verifiable Rewards (RLVR) Surprises with Qwen2.5-Math
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一篇关于使用强化学习和可验证奖励(RLVR)的研究文章,探讨了在数学推理任务中,即使使用不准确的或随机的反馈信号,也能提升语言模型(特别是Qwen系列模型)性能的可能性。研究发现,Qwen模型在接受错误答案或随机奖励的训练后,数学推理能力显著提高,但这种现象并未在其他模型(如Llama3)中观察到。

🧠 研究的核心在于RLVR方法,它使用自动信号(例如数学正确性或句法特征)作为反馈,来训练语言模型。与依赖人类反馈的传统方法不同,RLVR旨在利用自动化反馈来增强模型的推理能力,尤其是在数学等领域。

📈 研究人员测试了多种奖励信号,包括真实答案、多数投票、格式奖励、随机奖励和错误奖励。令人惊讶的是,即使是完全虚假的信号,如随机奖励和错误答案奖励,也能显著提高Qwen模型的性能。例如,Qwen2.5-Math-7B在MATH-500数据集上,使用错误标签也获得了24.6%的提升。

💡 Qwen模型展现出“代码推理”的独特行为,倾向于生成类似代码结构的数学解题方案,这种行为在训练中变得更加频繁,且与更高的准确率相关。这表明,虚假奖励可能激活了模型在预训练中习得的潜在能力,而非引入新的推理技能。

⚠️ 然而,这种积极的效果并非普适。在Llama3和OLMo2等非Qwen模型上,使用相同的奖励策略并未观察到类似的提升,甚至出现了性能下降。这强调了在RLVR研究中,需要验证不同模型的效果,避免仅依赖于Qwen模型的结果。

📚 结论强调了RLVR方法在不同模型上的差异性,并警示研究人员在评估RLVR技术时,不应仅基于Qwen模型的结果进行推广。研究结果突显了在开发和应用RLVR时,需要考虑模型架构对奖励信号的敏感性。

In natural language processing (NLP), RL methods, such as reinforcement learning with human feedback (RLHF), have been utilized to enhance model outputs by optimizing responses based on feedback signals. A specific variant, reinforcement learning with verifiable rewards (RLVR), extends this approach by utilizing automatic signals, such as mathematical correctness or syntactic features, as feedback, enabling the large-scale tuning of language models. RLVR is especially interesting because it promises to enhance models’ reasoning abilities without needing extensive human supervision. This intersection of automated feedback and reasoning tasks forms an exciting area of research, where developers aim to uncover how models can learn to reason mathematically, logically, or structurally using limited supervision.

A persistent challenge in machine learning is building models that can reason effectively under minimal or imperfect supervision. In tasks like mathematical problem-solving, where the correct answer might not be immediately available, researchers grapple with how to guide a model’s learning. Models often learn from ground-truth data, but it’s impractical to label vast datasets with perfect accuracy, particularly in reasoning tasks that require understanding complex structures like proofs or programmatic steps. Consequently, there’s an open question about whether models can learn to reason if they are exposed to noisy, misleading, or even incorrect signals during training. This issue is significant because models that overly rely on perfect feedback may not generalize well when such supervision is unavailable, thereby limiting their utility in real-world scenarios.

Several existing techniques aim to enhance models’ reasoning abilities through reinforcement learning (RL), with RLVR being a key focus. Traditionally, RLVR has used “ground truth” labels, correct answers verified by humans or automated tools, to provide rewards during training. Some approaches have relaxed this requirement by using majority vote labels or simple format-based heuristics, such as rewarding answers that follow a specific output style. Other methods have experimented with random rewards, offering positive signals without considering the correctness of the answer. These methods aim to explore whether models can learn even with minimal guidance, but they mostly concentrate on specific models, such as Qwen, raising concerns about generalizability across different architectures.

Researchers from the University of Washington, the Allen Institute for AI, and UC Berkeley investigate this question by testing various reward signals on Qwen2.5-Math, a family of large language models fine-tuned for mathematical reasoning. They tested ground-truth rewards, majority-vote rewards, format rewards based on boxed expressions, random rewards, and incorrect rewards. Remarkably, they observed that even completely spurious signals, like random rewards and rewards for wrong answers, could lead to substantial performance gains in Qwen models. For example, training Qwen2.5-Math-7B on MATH-500 with ground-truth rewards yielded a 28.8% improvement, while using incorrect labels resulted in a 24.6% gain. Random rewards still produced a 21.4% boost, and format rewards led to a 16.4% improvement. Majority-vote rewards provided a 26.5% accuracy gain. These improvements were not limited to a single model; Qwen2.5-Math-1.5B also showed strong gains: format rewards boosted accuracy by 17.6%, and incorrect labels by 24.4%. However, the same reward strategies failed to deliver similar benefits on other model families, such as Llama3 and OLMo2, which showed minimal or negative changes when trained with spurious rewards. For instance, Llama3.1-8B saw performance drops of up to 8.5% under certain spurious signals, highlighting the model-specific nature of the observed improvements.

The research team’s approach involved using RLVR training to fine-tune models with these varied reward signals, replacing the need for ground-truth supervision with heuristic or randomized feedback. They found that Qwen models, even without access to correct answers, could still learn to produce high-quality reasoning outputs. A key insight was that Qwen models tended to exhibit a distinct behavior called “code reasoning”, generating math solutions structured like code, particularly in Python-like formats, regardless of whether the reward signal was meaningful. This code reasoning tendency became more frequent over training, rising from 66.7% to over 90% in Qwen2.5-Math-7B when trained with spurious rewards. Answers that included code reasoning showed higher accuracy rates, often around 64%, compared to just 29% for answers without such reasoning patterns. These patterns emerged consistently, suggesting that spurious rewards may unlock latent capabilities learned during pretraining rather than introducing new reasoning skills.

Performance data underscored the surprising robustness of Qwen models. Gains from random rewards (21.4% on MATH-500) and incorrect labels (24.6%) nearly matched the ground-truth reward gain of 28.8%. Similar trends appeared across tasks, such as AMC, where format, wrong, and random rewards produced around an 18% improvement, only slightly lower than the 25% improvement from ground-truth or majority-vote rewards. Even on AIME2024, spurious rewards like format (+13.0%), incorrect (+8.7%), and random (+6.3%) led to meaningful gains, though the advantage of ground-truth labels (+12.8%) remained evident, particularly for AIME2025 questions created after model pretraining cutoffs.

Several Key Takeaways from the research include:

In conclusion, these findings suggest that while Qwen models can leverage spurious signals to improve performance, the same is not true for other model families. Non-Qwen models, such as Llama3 and OLMo2, showed flat or negative performance changes when trained with spurious signals. The research emphasizes the importance of validating RLVR methods on diverse models rather than relying solely on Qwen-centric results, as many recent papers have done.


Check out the Paper, Official Release and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post Incorrect Answers Improve Math Reasoning? Reinforcement Learning with Verifiable Rewards (RLVR) Surprises with Qwen2.5-Math appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

强化学习 RLVR Qwen 数学推理 语言模型
相关文章