MarkTechPost@AI 2024年07月25日
Researchers at Google Deepmind Introduce BOND: A Novel RLHF Method that Fine-Tunes the Policy via Online Distillation of the Best-of-N Sampling Distribution
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google DeepMind的研究人员提出了一种名为BOND的RLHF方法,它通过在线蒸馏Best-of-N采样分布来微调策略。BOND是一种分布匹配算法,它将策略的输出与Best-of-N分布对齐,从而在不增加计算成本的情况下复制Best-of-N采样的性能。该方法使用杰弗里斯散度来平衡模式覆盖和模式寻找行为,并通过移动锚点方法迭代地细化策略。实验表明,BOND,特别是它的变体J-BOND,在抽象摘要和Gemma模型上优于其他RLHF算法,因为它增强了KL-奖励权衡和基准性能。

🤔 BOND是一种新的RLHF方法,通过在线蒸馏Best-of-N采样分布来微调策略,它能够在不增加计算成本的情况下复制Best-of-N采样的性能。

📈 BOND使用杰弗里斯散度来平衡模式覆盖和模式寻找行为,并通过移动锚点方法迭代地细化策略,从而提高了KL-奖励权衡和基准性能。

🚀 J-BOND是BOND算法的实用实现,它通过整合蒙特卡罗分位数估计、结合正向和反向KL散度目标以及使用具有指数移动平均锚点的迭代过程,提高了实用性和效率。

💡 BOND方法涉及两个主要步骤:首先,它推导出Best-of-N (BoN)分布的解析表达式;其次,它将任务定义为分布匹配问题,旨在使策略与BoN分布对齐。

📊 实验表明,J-BOND优于传统的RLHF方法,证明了其在没有固定正则化水平的情况下也能有效地实现更好的性能。

🎯 BOND通过在线蒸馏Best-of-N采样分布来微调策略,使策略分布更接近Best-of-N分布,在抽象摘要和Gemma模型的实验中证明了其有效性。

Reinforcement learning from human feedback RLHF is essential for ensuring quality and safety in LLMs. State-of-the-art LLMs like Gemini and GPT-4 undergo three training stages: pre-training on large corpora, SFT, and RLHF to refine generation quality. RLHF involves training a reward model (RM) based on human preferences and optimizing the LLM to maximize predicted rewards. This process is challenging due to forgetting pre-trained knowledge and reward hacking. A practical approach to enhance generation quality is Best-of-N sampling, which selects the best output from N-generated candidates, effectively balancing reward and computational cost.

Researchers at Google DeepMind have introduced Best-of-N Distillation (BOND), an innovative RLHF algorithm designed to replicate the performance of Best-of-N sampling without its high computational cost. BOND is a distribution matching algorithm that aligns the policy’s output with the Best-of-N distribution. Using Jeffreys divergence, which balances mode-covering and mode-seeking behaviors, BOND iteratively refines the policy through a moving anchor approach. Experiments on abstractive summarization and Gemma models show that BOND, particularly its variant J-BOND, outperforms other RLHF algorithms by enhancing KL-reward trade-offs and benchmark performance.

Best-of-N sampling optimizes language generation against a reward function but is computationally expensive. Recent studies have refined its theoretical foundations, provided reward estimators, and explored its connections to KL-constrained reinforcement learning. Various methods have been proposed to match the Best-of-N strategy, such as supervised fine-tuning on Best-of-N data and preference optimization. BOND introduces a novel approach using Jeffreys divergence and iterative distillation with a dynamic anchor to efficiently achieve the benefits of Best-of-N sampling. This method focuses on investing resources during training to reduce inference-time computational demands, aligning with principles of iterated amplification.

The BOND approach involves two main steps. First, it derives an analytical expression for the Best-of-N (BoN) distribution. Second, it frames the task as a distribution matching problem, aiming to align the policy with the BoN distribution. The analytical expression shows that BoN reweights the reference distribution, discouraging poor generations as N increases. The BOND objective seeks to minimize divergence between the policy and BoN distribution. The Jeffreys divergence, balancing forward and backward KL divergences, is proposed for robust distribution matching. Iterative BOND refines the policy by repeatedly applying the BoN distillation with a small N, enhancing performance and stability.

J-BOND is a practical implementation of the BOND algorithm designed for fine-tuning policies with minimal sample complexity. It iteratively refines the policy to align with the Best-of-2 samples using the Jeffreys divergence. The process involves generating samples, calculating gradients for forward and backward KL components, and updating policy weights. The anchor policy is updated using an Exponential Moving Average (EMA), which enhances training stability and improves the reward/KL trade-off. Experiments show that J-BOND outperforms traditional RLHF methods, demonstrating effectiveness and better performance without needing a fixed regularization level.

BOND is a new RLHF method that fine-tunes policies through the online distillation of the Best-of-N sampling distribution. The J-BOND algorithm enhances practicality and efficiency by integrating Monte-Carlo quantile estimation, combining forward and backward KL divergence objectives, and using an iterative procedure with an exponential moving average anchor. This approach improves the KL-reward Pareto front and outperforms state-of-the-art baselines. By emulating the Best-of-N strategy without its computational overhead, BOND aligns policy distributions closer to the Best-of-N distribution, demonstrating its effectiveness in experiments on abstractive summarization and Gemma models.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post Researchers at Google Deepmind Introduce BOND: A Novel RLHF Method that Fine-Tunes the Policy via Online Distillation of the Best-of-N Sampling Distribution appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RLHF BOND Best-of-N 杰弗里斯散度 在线蒸馏
相关文章