MarkTechPost@AI 04月07日 11:50
Scalable and Principled Reward Modeling for LLMs: Enhancing Generalist Reward Models RMs with SPCT and Inference-Time Optimization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何通过Self-Principled Critique Tuning (SPCT) 和推理时优化来增强大型语言模型(LLMs)的奖励模型(RMs)。研究人员提出的DeepSeek-GRM模型在多个基准测试中表现出色,尤其是在推理时间扩展方面。SPCT方法使奖励模型能够生成自适应原则和准确的批评,从而提高奖励质量。通过并行采样和元奖励模型,DeepSeek-GRM在不依赖更大模型尺寸的情况下实现了高性能,为LLMs提供了更可靠、通用的奖励系统。

💡 奖励模型面临挑战:当前高质量奖励模型主要基于规则系统或可验证任务,在通用应用中,奖励标准多样且主观,缺乏明确的真实值。

⚙️ SPCT方法的核心:SPCT通过rejective fine-tuning初始化原则和批评生成,并通过基于规则的强化学习进行优化,从而使奖励模型能够动态生成自适应原则和准确的批评,提高奖励的粒度。

🚀 推理时优化策略:研究人员采用并行采样和元奖励模型来提升推理性能。元奖励模型用于过滤低质量输出,从而实现高效的推理时间扩展,并提升DeepSeek-GRM的表现。

🏆 DeepSeek-GRM的优势:DeepSeek-GRM模型在多个基准测试中表现出色,优于现有方法,尤其是在推理时间扩展方面。这表明,通过SPCT和推理优化,可以在不增加模型大小的情况下,显著提高奖励模型的性能。

🔬 未来研究方向:未来工作包括将GRMs集成到RL流程中,与策略模型共同扩展,并作为可靠的离线评估器。

Reinforcement Learning RL has become a widely used post-training method for LLMs, enhancing capabilities like human alignment, long-term reasoning, and adaptability. A major challenge, however, is generating accurate reward signals in broad, less structured domains, as current high-quality reward models are largely built on rule-based systems or verifiable tasks such as math and coding. In general applications, reward criteria are more diverse and subjective, lacking clear ground truths. To address this, generalist reward models (RMs) are being explored for broader applicability. However, these models must balance input flexibility and scalability during inference, particularly in producing reliable, high-quality rewards across varied tasks and domains.

Existing reward modeling approaches include scalar, semi-scalar, and generative techniques, each with flexibility and inference-time performance trade-offs. For instance, pairwise models are limited to relative comparisons, while scalar models struggle with producing diverse feedback. Generative reward models (GRMs) offer richer, more flexible outputs, making them more suited for evaluating various responses. Recent work has explored training GRMs through offline RL, integrating tools and external knowledge to improve reward quality. However, few methods directly address how RMs can scale efficiently during inference. This has led to research on methods like sampling-based scaling, chain-of-thought prompting, and reward-guided aggregation, aiming to co-scale policy models and reward models during inference. These developments hold promise for more robust, general-purpose reward systems in LLMs.

DeepSeek-AI and Tsinghua University researchers explore enhancing reward models RM for general queries by improving inference-time scalability using increased computing and better learning techniques. They employ pointwise GRM for flexible input handling and propose a learning method—Self-Principled Critique Tuning (SPCT)—which helps GRMs generate adaptive principles and accurate critiques during online reinforcement learning. They apply parallel sampling and introduce a meta RM to scale effectively and refine the voting process. Their DeepSeek-GRM models outperform existing benchmark methods, offering higher reward quality and scalability, with plans for open-sourcing despite challenges in some complex tasks.

The researchers introduce SPCT, a method designed to enhance pointwise GRMs by enabling them to generate adaptive principles and accurate critiques. SPCT consists of two stages: rejective fine-tuning for initializing principle and critique generation and rule-based RL for refinement. Instead of treating principles as preprocessing, they are generated dynamically during inference. This promotes scalability by improving reward granularity. Additionally, inference-time performance is boosted through parallel sampling and voting, supported by a meta reward model (meta RM) that filters out low-quality outputs. Overall, SPCT improves reward accuracy, robustness, and scalability in GRMs.

Using standard metrics, the study evaluates various RM methods across benchmarks like Reward Bench, PPE, RMB, and ReaLMistake. DeepSeek-GRM-27B consistently outperforms baselines and rivals strong public models like GPT-4o. Inference-time scaling, especially with voting and meta reward models, significantly boosts performance—achieving results comparable to much larger models. Ablation studies highlight the importance of components like principle generation and non-hinted sampling. Training-time scaling shows diminishing returns compared to inference-time strategies. Overall, DeepSeek-GRM, enhanced with SPCT and meta RM, offers robust, scalable reward modeling with reduced domain bias and strong generalization.

In conclusion, the study presents SPCT, a method that improves inference-time scalability for GRMs through rule-based online reinforcement learning. SPCT enables adaptive principle and critique generation, enhancing reward quality across diverse tasks. DeepSeek-GRM models outperform several baselines and strong public models, especially when paired with a meta reward model for inference-time scaling. Using parallel sampling and flexible input handling, these GRMs achieve strong performance without relying on larger model sizes. Future work includes integrating GRMs into RL pipelines, co-scaling with policy models, and serving as reliable offline evaluators.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post Scalable and Principled Reward Modeling for LLMs: Enhancing Generalist Reward Models RMs with SPCT and Inference-Time Optimization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

奖励模型 LLMs SPCT 推理优化
相关文章