MarkTechPost@AI 2024年07月28日
This Paper from Google DeepMind Presents Conditioned Language Policies (CLP): A Machine Learning Framework for Finetuning Language Models on Multiple Objectives
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

谷歌DeepMind团队提出了一种名为条件语言策略(CLP)的通用多目标微调(MOFT)框架,该框架利用参数空间条件和多任务训练来微调语言模型。与基于提示的技术相比,该方法具有更强的可操控性,因为它利用了RS中的参数条件。此外,CLP通过对不同奖励权重进行微调,能够生成比RS等零样本方法更高质量的响应,同时保持或提升可操控性。实验结果表明,CLP在帕累托最优方面优于RS,并且比基于提示的MOFT更易于控制。该框架在各种条件下(包括不同的奖励选择和模型大小)始终保持这些优势。

🤖 CLP 框架通过参数空间条件和多任务训练,能够有效地微调语言模型,以平衡不同的个体奖励。该框架利用参数平均方法学习一组参数,这些参数可以被处理成一个条件语言模型,用于任何给定的奖励和 KL 权重组合。

💪 CLP 框架在帕累托最优方面优于 RS,并且比基于提示的 MOFT 更易于控制。该框架能够学习一组参数,这些参数可以被处理成一个条件语言模型,用于任何给定的奖励和 KL 权重组合。

🧠 CLP 框架通过多任务学习,在不同的权重组合上进行训练,从而最大化 MOFT 目标函数。该框架能够学习一组参数,这些参数可以被处理成一个条件语言模型,用于任何给定的奖励和 KL 权重组合。

📈 CLP 框架在各种条件下(包括不同的奖励选择和模型大小)始终保持这些优势。该框架能够学习一组参数,这些参数可以被处理成一个条件语言模型,用于任何给定的奖励和 KL 权重组合。

🚀 CLP 框架为 MOFT 提供了一个灵活的框架,它使用多任务训练和高效的参数微调来创建适应性强的语言模型,能够有效地平衡不同的个体奖励。该框架能够学习一组参数,这些参数可以被处理成一个条件语言模型,用于任何给定的奖励和 KL 权重组合。

Reinforcement Learning (RL) finetuning is an important step in training language models (LMs) to behave in specific ways and follow human etiquette. In today’s applications, RL finetuning involves multiple goals due to various human preferences and uses. The multi-objective finetuning (MOFT) is needed to train a multi-objective LM to overcome the limitations of single-objective finetuning (SOFT). For LMs, MOFT has been explored through prompt-based and parameter-based methods. Prompt-based methods finetune an LM by including the reward weightings in the prompt. However, this approach can be less effective in guiding the model and sensitive to how the weightings are presented. Further, zero-shot MOFT might perform badly on intermediate weightings, which are not encountered during training.

The two main techniques to approach multi-reward alignment (or MOFT) are prompt-based, and parameter-based conditioning. Prompt-based conditioning contains approaches like Personalized Soups (PS), which use custom prompts to personalize language models (LMs) based on binary weights for different rewards. Rewarded Soups (RS) offers a zero-shot method by averaging the parameters of LMs trained independently at inference time. A recent paper introduces embedding reward weightings as singular values within the AdaLoRA framework. For KL realignment, decoding time realignment linearly mixes logits between ?ref and another LM learned through SOFT with the minimum KL weight.

A team from Google has proposed a general MOFT framework called Conditional Language Policy (CLP), that uses parameter-space conditioning and multi-task training. This method is more steerable than purely prompt-based techniques because it uses parameter-conditioning from RS. Moreover, CLP produces higher-quality responses than zero-shot methods like RS by finetuning on different reward weightings, while having the same or better steerability. The team conducted a series of experiments and found that CLP outperforms Pareto-dominates RS and is more controllable than prompt-based MOFT. It consistently maintains these advantages in various conditions, including different reward choices and model sizes.

The proposed method CLP, learns a set of parameters that can be processed into a conditioned language model (LM) for any given weighting across rewards and KL, using a parameter-averaging method. The learning algorithm samples a range of weightings to improve its Pareto-front for all weightings at once. This approach includes multi-task learning across different weightings, maximizing the MOFT objective. An automated evaluation with Gemini 1.0 Ultra shows that CLP is more adaptable and generates better responses than existing baselines. The team proposed a new theory showing that zero-shot methods can be nearly Pareto-optimal when optimal policies are aligned for individual reward.

The benchmarking results were obtained for the following setups: Single Reward, Multi KL Regularizer, Two Rewards, Fixed KL Regularizer, and Three Rewards, Fixed KL Regularizer. In the Single Reward, CLP is 2 times more computationally efficient than DeRa during inference because DeRa makes two LM calls per token. The multi-task training helps this method enhance over the zero-shot RS baseline regarding performance. Also, the full-CLP and attn-CLP maintain a more spread-out and steerable Pareto-front compared to logit-CLP and the prompting baseline. In sum, attn-CLP offers a good balance between Pareto-front and steerability while using fewer parameters than current baselines.

In this paper, a team from Google introduced Conditional Language Policy (CLP), a flexible framework for MOFT that uses multi-task training and efficient parameter finetuning to create adaptable language models (LMs) that can balance different individual rewards efficiently. The paper includes extensive benchmarking and ablation studies to understand the factors that help develop steerable LMs within the CLP framework. The team also proposed theoretical results to show the working of zero-shot approaches and the need for multi-task training for near-optimal behavior. Future research includes other conditioning mechanisms like soft tokens, automating the tuning of weight sampling distributions, and addressing non-linear reward scalarization.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post This Paper from Google DeepMind Presents Conditioned Language Policies (CLP): A Machine Learning Framework for Finetuning Language Models on Multiple Objectives appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

条件语言策略 多目标微调 语言模型 参数空间条件 多任务训练
相关文章