热点
关于我们
xx
xx
"
奖励函数
" 相关文章
The perils of under- vs over-sculpting AGI desires
少点错误
2025-08-05T18:20:02.000000Z
Towards Improving Reward Design in RL: A Reward Alignment Metric for RL Practitioners
cs.AI updates on arXiv.org
2025-07-28T04:43:08.000000Z
Application of LLM Guided Reinforcement Learning in Formation Control with Collision Avoidance
cs.AI updates on arXiv.org
2025-07-23T04:03:25.000000Z
Misalignment from Treating Means as Ends
cs.AI updates on arXiv.org
2025-07-16T04:29:03.000000Z
A Review of Reward Functions for Reinforcement Learning in the context of Autonomous Driving
cs.AI updates on arXiv.org
2025-07-15T04:24:18.000000Z
我们找到3位大学教授,聊了聊越来越严重的AI幻觉
36kr
2025-07-15T03:24:15.000000Z
研究人员提出因果贝尔曼方程,在特定线上学习算法中可更快得到最优智能体
MIT 科技评论 - 本周热榜
2025-07-13T16:21:35.000000Z
Improving Trust Estimation in Human-Robot Collaboration Using Beta Reputation at Fine-grained Timescales
cs.AI updates on arXiv.org
2025-07-09T04:02:04.000000Z
Feature-Based vs. GAN-Based Learning from Demonstrations: When and Why
cs.AI updates on arXiv.org
2025-07-09T04:01:51.000000Z
极客说|强化学习(RL)与有监督微调(SFT)的选择以及奖励函数的优化
掘金 人工智能
2025-06-22T09:02:23.000000Z
从RLHF、PPO到GRPO再训练推理模型,这是你需要的强化学习入门指南
机器之心
2025-06-22T07:57:51.000000Z
如果人生是一场强化学习,你的奖励函数是什么?
橘子汽水铺
2025-04-19T07:13:18.000000Z
The Theoretical Reward Learning Research Agenda: Introduction and Motivation
少点错误
2025-02-28T19:22:06.000000Z
SFT并非必需!推理模型仅靠RL就能获得长思维链能力,清华CMU团队破解黑盒
智源社区
2025-02-10T05:07:14.000000Z
Reward Hacking in Reinforcement Learning
Lil'Log
2024-12-02T04:05:33.000000Z
Policy Gradient Algorithms
Lil'Log
2024-11-09T05:43:41.000000Z
Anthropic: ↩️ We find that models generalize, without explicit training, from easily-discoverable dishonest strategies like sycophancy to more conce...
AnthropicAI推特
2024-06-18T06:33:36.000000Z
Anthropic: ↩️ We designed a curriculum of increasingly complex environments with misspecified reward functions. Early on, AIs discover dishonest str...
AnthropicAI推特
2024-06-18T06:33:36.000000Z