MarkTechPost@AI 04月06日 01:45
Scalable Reinforcement Learning with Verifiable Rewards: Generative Reward Modeling for Unstructured, Multi-Domain Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了利用可验证奖励的强化学习(RLVR)来提升大型语言模型(LLM)的推理和编码能力。研究重点是扩展RLVR的应用范围,从结构化任务(如数学和编码)到更复杂、非结构化领域(如医学和教育)。通过引入生成式奖励建模,LLMs可以生成判断和理由,从而在存在噪声或模糊标签的任务中实现强化学习。研究表明,使用紧凑的7B模型训练跨领域奖励验证器,无需大量的特定领域注释,即可显著提升性能,尤其是在推理任务中。研究还发布了一个大规模数据集,以支持多领域RLVR的进一步研究。

🧠 RLVR通过可验证奖励增强LLM的推理和编码能力,尤其在结构化任务中表现出色,例如数学和编码。

💡 研究引入生成式奖励建模,LLMs生成判断和理由,用于处理噪声或模糊标签的任务,从而扩展了RLVR的应用范围。

🔬 研究人员使用紧凑的7B模型,训练跨领域奖励验证器,无需大量特定领域注释,即可显著提升性能。

📊 研究表明,在自由形式任务中,基于模型的软奖励优于基于规则的方法,并优于监督微调(SFT)。

📚 研究使用了两个大规模中文QA数据集,包含773k个自由形式数学问题和638k个多学科大学水平问题,以测试RLVR方法。

Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing LLMs’ reasoning and coding abilities, particularly in domains where structured reference answers allow clear-cut verification. This approach relies on reference-based signals to determine if a model’s response aligns with a known correct answer, typically through binary correctness labels or graded scores. RLVR has mainly been applied to areas like math and coding, where rule-based or tool-assisted verification is straightforward. However, expanding RLVR to more complex and less structured tasks has been difficult due to challenges in verifying open-ended or ambiguous reference responses. Although generative models and closed-source LLMs like GPT-4o have been explored as verifiers, these solutions often remain domain-specific and require extensive annotated datasets for training.

Recent developments aim to broaden RLVR applications by introducing generative reward modeling, where LLMs use their generative abilities to produce judgments and justifications. These models can be trained without detailed rationales, instead relying on the confidence of the verifier’s outputs to generate stable reward signals. This technique supports reinforcement learning in tasks with noisy or ambiguous labels. Furthermore, researchers are exploring RLVR in a wider variety of domains using more free-form reference answers—sourced from expert annotations and pretraining data or generated by LLMs—moving beyond narrowly defined tasks like math and logic puzzles. These efforts mark a significant step toward scalable and domain-general RLVR training.

Tencent AI Lab and Soochow University researchers are exploring extending RLVR to complex, unstructured domains like medicine, chemistry, and education. They show that binary correctness judgments remain consistent across LLMs when expert-written references are available. To address the limitations of binary rewards in free-form tasks, they introduce soft, generative model-based reward signals. Using compact 7B models, they train cross-domain reward verifiers without requiring extensive domain-specific annotation. Their RLVR framework significantly outperforms top open-source models in reasoning tasks and scales effectively. They also release a 570k-example dataset to support further research in multi-domain RLVR.

The method uses expert-written reference answers to guide reward estimation for reinforcement learning. Responses are evaluated using a generative LLM verifier, which outputs binary (0/1) or soft rewards based on the likelihood of correctness. Rewards are normalized using z-score normalization for stable training and better learning dynamics. The authors train a compact (7B) generative reward model using judgments collected during RL exploration to avoid relying solely on large models. These binary labels are obtained from a larger LLM and used to fine-tune the smaller verifier. This approach balances performance and efficiency while increasing robustness to noise and formatting variations.

The study uses two large-scale Chinese QA datasets—one with 773k free-form math questions across school levels and another with 638k multi-subject college-level questions from ExamQA. These datasets feature complex, unstructured answers that challenge rule-based reward methods. The researchers trained a 7B reward model (RM-7B) using 160k distilled samples and tested various RL approaches. Results show that RL with model-based rewards outperforms rule-based methods and supervised fine-tuning (SFT), especially in reasoning tasks. Notably, RM-7B achieves performance close to the larger 72B model, highlighting its efficiency. Binary rewards outperform soft rewards in rule-based settings due to semantic mismatch issues.

In conclusion, the study simplifies reward modeling by training a generative model to output binary scores (1 or 0) without relying on chain-of-thought reasoning. While CoT aids in reasoning, its necessity for verifying semantic similarity remains unclear. Unlike past work that relied on format-based scoring, this approach avoids strict answer formatting, reducing manual effort. The research extends RLVR beyond structured domains to areas like medicine and economics, where reference answers are less defined. Using a 7B model, it shows that soft, model-based rewards enhance performance in free-form tasks, outperforming larger models and improving RLVR’s adaptability and scalability.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post Scalable Reinforcement Learning with Verifiable Rewards: Generative Reward Modeling for Unstructured, Multi-Domain Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RLVR LLM 强化学习 生成式奖励
相关文章