The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

cs.AI updates on arXiv.org 07月09日 12:02

The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

本文探讨了强化学习中奖励函数学习与误差-遗憾不匹配问题，指出分布偏移是误差-遗憾不匹配的主要原因，并提出低预期测试误差可保证低最坏情况遗憾，但固定预期测试误差下，仍存在误差-遗憾不匹配的数据分布。同时，发现政策正则化技术也无法完全解决此问题，呼吁进一步研究改进奖励模型学习方法和质量评估。

arXiv:2406.15753v3 Announce Type: replace-cross Abstract: In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

强化学习奖励学习误差-遗憾不匹配分布偏移政策正则化

相关文章

Teaching Large Language Models to Reason with Reinforcement Learning with Alex Havrilla - #680

AI Trends 2024: Reinforcement Learning in the Age of LLMs with Kamyar Azizzadenesheli - #670

AI Trends 2023: Reinforcement Learning - RLHF, Robotic Pre-Training, and Offline RL with Sergey Levine - #612

Reinforcement Learning for Personalization at Spotify with Tony Jebara - #609

Deep Learning, Transformers, and the Consequences of Scale with Oriol Vinyals - #546

The Benefit of Bottlenecks in Evolving Artificial Intelligence with David Ha - #535

Advancing Deep Reinforcement Learning with NetHack, w/ Tim Rocktäschel - #527

Applying RL to Real-World Robotics with Abhishek Gupta - #466

Off-Line, Off-Policy RL for Real-World Decision Making at Facebook - #448

MOReL: Model-Based Offline Reinforcement Learning with Aravind Rajeswaran - #442