MarkTechPost@AI 2024年10月05日
LASER: An Adaptive Method for Selecting Reward Models RMs and Iteratively Training LLMs Using Multiple Reward Models RMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

LASER是一种用于解决大语言模型训练中奖励模型选择难题的方法。它将RM选择视为多臂老虎机问题,动态选择最适合的RM,提高了训练效率和任务性能,克服了现有方法的局限。

LASER将奖励模型(RM)选择视为多臂老虎机问题,在训练过程中为每个任务或实例动态选择最合适的RM,而非同时运行多个RM,以此降低计算开销。

该方法使用LinUCB老虎机算法,根据任务上下文和过去表现来适应RM选择。通过在实例级别优化RM选择,LASER提高了LLM训练的效率和准确性。

LASER在多个基准测试上进行了测试,包括StrategyQA、GSM8K和WildChat数据集,涵盖推理、数学和指令跟随等任务,在多个方面取得了显著提升。

One of the major challenges in aligning large language models (LLMs) with human preferences is the difficulty in selecting the right reward model (RM) to guide their training. A single RM may excel at tasks like creative writing but fail in more logic-oriented areas like mathematical reasoning. This lack of generalization leads to suboptimal performance and issues like reward hacking. At the same time, using multiple RMs simultaneously is computationally expensive and introduces conflicting signals. Overcoming these challenges is crucial for developing more adaptable and accurate AI systems capable of handling diverse real-world applications.

Current approaches either rely on a single RM or combine multiple RMs in an ensemble. Single RMs struggle to generalize across tasks, leading to poor performance, especially when encountering complex, multi-domain problems. Ensemble methods mitigate this but come with high computational costs and face difficulties in handling noisy or conflicting signals from the RMs. These limitations slow down model training and degrade overall performance, creating inefficiencies that hinder widespread, real-time applications.

The researchers from UNC Chapel Hill propose LASER (Learning to Adaptively Select Rewards), which frames RM selection as a multi-armed bandit problem. Instead of loading and running multiple RMs simultaneously, LASER dynamically selects the most suitable RM for each task or instance during training. The method uses the LinUCB bandit algorithm, which adapts RM selection based on task context and past performance. By optimizing RM selection on an instance level, LASER reduces computational overhead while improving the efficiency and accuracy of LLM training across a diverse set of tasks, avoiding the reward hacking problems seen in single RM methods.

LASER operates by iterating through tasks, generating multiple responses from the LLM, and scoring them with the most appropriate RM selected by the MAB. Using the LinUCB algorithm, the MAB balances exploration (testing new RMs) and exploitation (using high-performing RMs). The method was tested on various benchmarks such as StrategyQA, GSM8K, and the WildChat dataset, covering reasoning, mathematical, and instruction-following tasks. LASER continuously adapts its RM selection process, leading to improved training efficiency and accuracy across these domains. The dynamic selection also enables better handling of noisy or conflicting RMs, resulting in more robust performance.

The researchers demonstrated that LASER consistently enhanced LLM performance across several benchmarks. For reasoning tasks like StrategyQA and GSM8K, LASER improved average accuracy by 2.67% compared to ensemble methods. On instruction-following tasks, LASER achieved a 71.45% win rate, outperforming sequential RM selection. In long-context understanding tasks, LASER delivered substantial improvements, increasing F1 scores by 2.64 and 2.42 points in single- and multi-document QA tasks, respectively. Overall, LASER’s adaptive RM selection led to more efficient training, reduced computational complexity, and improved generalization across a wide range of tasks.

In conclusion, LASER represents a significant advancement in reward model selection for LLM training. By dynamically selecting the most appropriate RM for each instance, LASER improves both training efficiency and task performance across diverse benchmarks. This method addresses the limitations of single and ensemble RM approaches, offering a robust solution to optimize LLM alignment with human preferences. With its capacity to generalize across tasks and handle noisy rewards, LASER is poised to have a lasting impact on future AI development.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!

The post LASER: An Adaptive Method for Selecting Reward Models RMs and Iteratively Training LLMs Using Multiple Reward Models RMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LASER 奖励模型 LLM训练
相关文章