cs.AI updates on arXiv.org 07月09日 12:02
The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了强化学习中奖励函数学习与误差-遗憾不匹配问题,指出分布偏移是误差-遗憾不匹配的主要原因,并提出低预期测试误差可保证低最坏情况遗憾,但固定预期测试误差下,仍存在误差-遗憾不匹配的数据分布。同时,发现政策正则化技术也无法完全解决此问题,呼吁进一步研究改进奖励模型学习方法和质量评估。

arXiv:2406.15753v3 Announce Type: replace-cross Abstract: In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low error on the data distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distributional shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. We hope our results stimulate the theoretical and empirical study of improved methods to learn reward models, and better ways to measure their quality reliably.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

强化学习 奖励学习 误差-遗憾不匹配 分布偏移 政策正则化
相关文章