热点
"奖励黑客攻击" 相关文章
Reward hacking is becoming more sophisticated and deliberate in frontier LLMs
少点错误 2025-04-24T16:07:40.000000Z
o3 Is a Lying Liar
少点错误 2025-04-23T20:02:32.000000Z
MONA: Three Month Later - Updates and Steganography Without Optimization Pressure
少点错误 2025-04-12T23:17:19.000000Z
MONA: Managed Myopia with Approval Feedback
少点错误 2025-01-23T12:37:32.000000Z
Reward Hacking in Reinforcement Learning
Lil'Log 2024-12-02T04:05:33.000000Z