MarkTechPost@AI 01月28日
Test-Time Preference Optimization: A Novel AI Framework that Optimizes LLM Outputs During Inference with an Iterative Textual Reward Policy
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了上海AI实验室提出的测试时偏好优化(TPO)框架,该框架旨在推理阶段使大型语言模型(LLM)的输出与人类偏好对齐。TPO将偏好优化视为在线、在策略学习过程,模型通过与奖励模型的交互不断改进输出。它使用可解释的文本反馈而非数值评分,将奖励信号转化为文本批评,并生成改进建议。实验表明,TPO能在少量优化步骤后显著提升对齐和未对齐模型的性能,甚至使未对齐模型超越经过传统训练优化的模型。该框架无需重新训练,具有高可扩展性和灵活性。

🚀 TPO框架核心:TPO是一种在线、在策略的学习范式,它在推理时通过与奖励模型的交互,不断优化LLM的输出,使其更符合人类偏好。

📝 文本反馈机制:TPO采用可解释的文本反馈,将奖励信号转化为文本批评,模型基于这些批评生成改进建议,而不是传统的数值评分方式。

📈 性能提升显著:实验结果表明,通过少量TPO优化步骤,无论是对齐还是未对齐的LLM,其性能都得到了显著提升,甚至未对齐模型在TPO优化后能超越传统训练优化的模型。

⚙️ 无需重新训练:TPO在推理阶段进行优化,避免了耗时的重新训练和权重更新,提供了更高的灵活性和可扩展性。

Large Language Models (LLMs) have become an indispensable part of contemporary life, shaping the future of nearly every conceivable domain. They are widely acknowledged for their impressive performance across tasks of varying complexity. However, instances have arisen where LLMs have been criticized for generating unexpected and unsafe responses. Consequently, ongoing research aims to align LLMs more closely with human preferences while fully leveraging their extensive training data.

Methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have proven effective. However, they still require iterative training, which is often impractical. Researchers are therefore focusing on modifying inference approaches to match the performance of training-based optimization methods. This article explores the latest research that enhances human preference alignment during inference time.

Researchers from Shanghai AI Laboratory have introduced Test-Time Preference Optimization (TPO), a novel framework designed to align LLM outputs with human preferences during inference. This framework can be conceptualized as an online, on-policy learning paradigm, where the policy model continuously interacts with a novel reward model to refine its outputs.

TPO incorporates a mechanism to leverage interpretable textual feedback for preference optimization instead of conventional numerical scoring.   To achieve this, authors translate reward signals into textual rewards through critiques. The model then generates suggestions by the transformed rewards and updates its outputs to align with the signals during testing.

During the actual test, the newly generated responses are scored at each inference-time optimization step, and the extreme ends of response quality are classified as “chosen” or “rejected” outputs. The model then learns the strength from the best or “chosen” outputs and the shortfalls of rejected responses to compile a  “textual loss.” The model then generates suggestions or “ textual gradients” for the next iteration. TPO thus improves the output iteratively based on interactions with text rewards.

The authors used aligned and unaligned policy models to validate the concept and determine whether the model had undergone preference optimization during training. Two key models included in the study were Llama-3.1-70B-SFT, an unaligned model that did not undergo preference optimization during training, and Llama-3.1-70B-Instruct, an aligned model trained with preference optimization. Additionally, experiments spanned many datasets to evaluate instruction following, preference alignment, safety, and mathematical reasoning.

Results from these experiments confirmed that a few TPO optimization steps significantly improved performance in both aligned and unaligned models. When comparing TPO-based inference optimization with traditional training optimization approaches, researchers found that the unaligned Llama-3.1-70B-SFT model outperformed its aligned counterpart Llama-3.1-70B-Instruct after undergoing TPO epochs. Furthermore, applying TPO to an aligned model with as few as 22 billion parameters achieved an LC score of 53.4% and a WR score of 72.2%

Conclusion: The research team introduced TPO, an online, on-policy learning framework to align outputs from LLMs by human preference. This framework optimized the responses in inference time and eliminated the hassle of retraining and weight updates. Additionally, TPO offered high scalability and flexibility, making it a promising approach for future LLM works.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

The post Test-Time Preference Optimization: A Novel AI Framework that Optimizes LLM Outputs During Inference with an Iterative Textual Reward Policy appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 偏好优化 推理 TPO 文本反馈
相关文章