MarkTechPost@AI 07月04日 09:20
Crome: Google DeepMind’s Causal Framework for Robust Reward Modeling in LLM Alignment
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Crome是Google DeepMind提出的一种因果鲁棒奖励模型,旨在解决LLM对齐中的奖励窃取问题。该框架通过构建明确的因果模型,生成包含因果和中性增强的合成训练数据,从而使奖励模型能够区分真实的质量驱动因素和表面的提示。实验结果表明,Crome在RewardBench和reWordBench上均表现出色,提高了LLM的安全性、推理能力和鲁棒性,为未来LLM对齐研究提供了新的方向。

💡奖励窃取问题:现有奖励模型易受响应长度或格式等表面属性的影响,而非事实性和相关性等真正质量指标。这导致模型脆弱,无法生成对齐的策略。

✨Crome框架:该框架基于LLM生成过程的显式因果模型,通过引入包含因果和中性增强的合成训练数据,训练奖励模型。

🌟合成数据增强:Crome使用两种类型的合成训练对:(a) 因果增强,改变事实性等特定因果属性以增强对真实质量变化的敏感度;(b) 中性增强,通过风格等属性来增强不变性。

🚀性能提升:Crome在RewardBench上提高了排名准确性,并在安全性(高达13.18%)和推理能力(高达7.19%)方面取得了显著进展。同时,在reWordBench上表现出更强的鲁棒性,并在WildGuardTest上提高了安全性。

Reward models are fundamental components for aligning LLMs with human feedback, yet they face the challenge of reward hacking issues. These models focus on superficial attributes such as response length or formatting rather than identifying true quality indicators like factuality and relevance. This problem arises because standard training objectives fail to differentiate between spurious correlations present in training data and genuine causal drivers of response quality. The failure to separate these factors leads to brittle reward models (RMs) that generate misaligned policies. Moreover, there is a need for a method that utilizes a causal understanding of preference formation to train RMs that are sensitive to causal quality attributes and invariant to various spurious cues.

Limitations of Existing RM Approaches and the Need for Causal Robustness

Existing methods try to solve reward hacking issues in standard RLHF systems that rely on Bradley-Terry or pairwise ranking methods. This includes architectural modifications, such as Odin, policy-level adjustments, and data-centric methods involving ensembles or consistency checks. Recent causal-inspired methods use MMD regularization against pre-specified spurious factors or estimate causal effects through corrected rewrites. However, these methods target only predetermined spurious factors, missing unknown correlates. While augmentation strategies remain coarse, and evaluation-focused methods fail to equip reward models with robust training mechanisms against diverse spurious variations.

Introducing Crome: Causally Robust Reward Modeling for LLMs

Researchers from Google DeepMind, McGill University, and MILA – Quebec AI Institute have proposed Crome (Causally Robust Reward Modeling), a framework built on an explicit causal model of answer generation. Crome trains RMs to differentiate genuine quality drivers from superficial cues by adding preference datasets with targeted, LLM-generated counterfactual examples. Moreover, it creates two types of synthetic training pairs: (a) Causal Augmentations, which introduce changes along specific causal attributes, such as factuality to enforce sensitivity to true quality shifts, and (b) Neutral Augmentations that enforce invariance along spurious attributes like style using tie-labels. Crome enhances robustness, increasing RewardBench accuracy by up to 4.5%, enhancing safety and reasoning.

Technical Approach: Counterfactual Augmentation and Composite Loss Optimization

The Crome operates through two main phases: generating attribute-aware counterfactual data based on a causal model and training the reward model with a specialized loss on combined data. It provides a theoretical analysis on how causal augmentation isolates true reward drivers from spurious correlates under an idealized model. Crome utilizes the UltraFeedback dataset with counterfactuals generated using Gemini 2.0 Flash, and evaluates performance on RewardBench and reWordBench. Researchers utilize diverse base LLMs in their experiments, including Gemma-2-9B-IT, Qwen2.5-7B, and Gemma-2-2B for both Pairwise Preference and Bradley-Terry reward models, with downstream alignment impact through Best-of-N selection on multiple tasks.

Performance Gains: From RewardBench to WildGuardTest

On RewardBench, Crome achieves improvements in ranking accuracy over RRM across diverse base models, with significant gains in Safety (up to 13.18%) and Reasoning (up to 7.19%) categories. Crome shows aggregate accuracy gains of up to 9.1% on reWordBench with Gemma-2-9B-IT in PairPM settings and superior performance on 21 out of 23 transformations. Moreover, it shows a smaller decrease in ranking accuracy from RewardBench to reWordBench compared to RRM (19.78% versus 21.54%). Crome shows excellent safety improvements on WildGuardTest with Best-of-N selection, achieving lower attack success ratios on harmful prompts while maintaining similar refusal rates on benign prompts.

Conclusion and Future Directions in Causal Data Augmentation

In conclusion, researchers introduced Crome, a causal framework that solves reward hacking issues during RM training. It employs two targeted synthetic data augmentation strategies: Causal Augmentations and Neutral Augmentations. Crome outperforms strong baselines across multiple base models and reward modeling techniques on RewardBench, and superior robustness on reWordBench against spurious correlations. This dataset curation-centered training method (i.e, Crome) opens new research directions in synthetic data generation for base model training, where causal attribute verification could prove highly beneficial for future developments in robust language model alignment.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Crome: Google DeepMind’s Causal Framework for Robust Reward Modeling in LLM Alignment appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Crome LLM对齐 奖励模型 因果关系
相关文章