MarkTechPost@AI 2024年08月08日
Meta-Rewarding LLMs: A Self-Improving Alignment Technique Where the LLM Judges Its Own Judgements and Uses the Feedback to Improve Its Judgment Skills
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta FAIR、加州大学伯克利分校和纽约大学的研究人员提出了一种名为元奖励的新方法,用于增强大型语言模型(LLM)的指令遵循能力。该方法通过引入元评判者来评估并选择用于优化偏好的评判,从而解决了先前自我奖励框架的局限性,并直接训练评判者。此外,它还包含一种新颖的长度控制技术,以解决AI反馈训练期间长度爆炸的问题。该模型的评判能力更接近人类评判者和 GPT-4 等高级 AI 评判者。

😄 **元奖励方法概述** 元奖励方法通过添加一个新的角色——元评判者,来改进现有演员和评判者角色,以增强大型语言模型(LLM)的指令遵循能力。元评判者利用类似于 LLM-as-a-Judge 的机制(称为 LLM-as-a-Meta-Judge)来评估模型的评判。这个过程有助于生成包含评判偏好对的训练数据,除了标准的演员响应之间的偏好。元奖励通过改进演员和评判者的技能来增强模型的整体指令遵循能力。

🤔 **元奖励方法的实现** 元奖励方法基于经过指令微调的 Llama-3-8B-Instruct 模型,作为种子模型。研究人员在评估微调(EFT)数据集上进行了监督微调(SFT),该数据集包含来自 Open Assistant 的排名人类响应。此步骤增强了模型充当评判者的能力。元奖励迭代使用 Llama-2-70B-Chat 生成的 20K 个提示。每次迭代从该集合中采样 5K 个提示,执行四次迭代。这种迭代方法增强了模型在演员和评判者角色中的性能。

🚀 **元奖励方法的优势** 评估元奖励的结果表明,长度控制的获胜率在 AlpacaEval 上从 22.9% 提高到 39.4%,甚至超过了 GPT-4-0314。该方法还优于增强的标准自我奖励训练(获胜率为 35.5%),突出了元评判者的重要性。在 Arena-Hard 基准测试中也观察到相同的性能,该基准测试测试模型处理复杂问题的能力。经过四次迭代,元奖励持续提高分数,比种子模型的 20.6% 分数提高了 8.5%。这些结果证明,元奖励增强了 LLM 在遵循指令和回答复杂问题方面的能力。

🧐 **元奖励方法的局限性** 尽管元奖励方法取得了显著成果,但研究人员也指出了其局限性。他们的 5 分评判系统有时会导致平局,因为响应质量的差异很小。未来研究需要探索更细粒度的评判方法,以减少平局的发生。

Large Language Models (LLMs) have made significant progress in following instructions and responding to user queries. However, the current instruction tuning process faces major challenges. Acquiring human-generated data for training these models is expensive and time-consuming. Moreover, the quality of such data is limited by human capabilities. This limitation is especially evident while addressing the ‘Super Alignment’ challenge, which aims to control potentially super-intelligent AIs whose actions may exceed human comprehension. There is a need to focus on finding effective methods in the AI field to guide LLMs’ development beyond human-level performance as they continue to advance.

Researchers have explored various methods to align LLMs with human values. One popular method is Reinforcement Learning from Human Feedback (RLHF), which uses the Proximal Policy Optimization (PPO) technique to train a reward model based on human preference data. The second method, the LLM-as-a-Judge has gained popularity for evaluation and reward model training. However, these methods often rely on human data or input from stronger models, potentially limiting their effectiveness for super alignment challenges. The last approach, Super Alignment includes Constitutional AI and CriticGPT which attempt to use AI to generate feedback but still struggle with training both the actor and judge components during self-improvement.

Researchers from Meta FAIR, the University of California, Berkeley, and New York University have introduced a new method called Meta-Rewarding to improve the instruction-following abilities of LLMs. This method adds a third role, the meta-judge, to the existing actor and judge roles. The meta-judge evaluates the model’s judgments using a mechanism similar to LLM-as-a-Judge, called LLM-as-a-Meta-Judge. This process helps to generate training data with preference pairs of judgments, in addition to the standard preferences between actor responses. The Meta-Rewarding enhances the overall instruction-following capability of the model by improving both acting and judging skills. 

​​The Meta-Rewarding method is developed on the instruction-finetuned Llama-3-8B-Instruct model, which serves as the seed model. Researchers performed supervised finetuning (SFT) on the Evaluation Fine-Tuning (EFT) dataset, which includes ranked human responses from Open Assistant. This step enhances the ability of the model to act as a judge. The Meta-Rewarding iterations use 20K prompts generated by Llama-2-70B-Chat. Each iteration samples 5K prompts from this set, performing four iterations. This iterative approach enhances the model’s performance in acting and judging roles. The experimental setup of previous work is closely followed, adapting it to include the meta-judge component for self-improvement.

The results obtained on evaluating Meta-Rewarding show that the length-controlled win rate increased from 22.9% to 39.4% on AlpacaEval, outperforming even GPT-4-0314. This method also outperforms the enhanced standard Self-Rewarding training, which had a win rate of 35.5%, highlighting the importance of the meta-judge. The same performance is seen on the Arena-Hard benchmark, which tests models’ ability to handle complex questions. After four iterations, Meta-Rewarding consistently improved scores, achieving an 8.5% increase over the seed model’s 20.6% score. These results prove that Meta-Rewarding enhances LLMs’ capabilities in following instructions and answering complex questions.

In conclusion, researchers proposed Meta-Rewarding, a new method to enhance the instruction-following abilities of LLMs. This method utilizes a meta-judge to evaluate and choose judgments for optimizing preferences, which addresses the limitations of previous Self-Rewarding frameworks by directly training the judge. Moreover, it includes a novel length-control technique to address issues of length explosion during AI feedback training. The model’s judgment abilities align more closely with human judges and advanced AI judges like GPT-4. However, the researchers address a limitation in their 5-point judging system, which occasionally leads to ties due to minimal differences in response quality.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here


The post Meta-Rewarding LLMs: A Self-Improving Alignment Technique Where the LLM Judges Its Own Judgements and Uses the Feedback to Improve Its Judgment Skills appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

元奖励 大型语言模型 AI对齐
相关文章