热点
"欺骗性对齐" 相关文章
Correcting Deceptive Alignment using a Deontological Approach
少点错误 2025-04-15T01:12:23.000000Z
Turning up the Heat on Deceptively-Misaligned AI
少点错误 2025-01-07T00:16:20.000000Z
A Dialogue on Deceptive Alignment Risks
少点错误 2024-09-25T16:10:21.000000Z
Untrustworthy models: a frame for scheming evaluations
少点错误 2024-08-19T21:51:56.000000Z
[Interim research report] Evaluating the Goal-Directedness of Language Models
少点错误 2024-07-18T18:20:59.000000Z