热点
关于我们
xx
xx
"
对齐伪装
" 相关文章
Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
少点错误
2025-04-08T17:42:18.000000Z
Alignment faking CTFs: Apply to my MATS stream
少点错误
2025-04-04T16:32:28.000000Z
Alignment faking in large language models
Newsroom Anthropic
2025-02-26T06:17:45.000000Z
Will alignment-faking Claude accept a deal to reveal its misalignment?
少点错误
2025-01-31T16:51:47.000000Z
AI Safety at the Frontier: Paper Highlights, December '24
少点错误
2025-01-11T23:00:46.000000Z
Can AI Be Trusted? The Challenge of Alignment Faking
Unite.AI
2025-01-07T17:30:33.000000Z
Anthropic:大型语言模型的伪装对齐
孔某人的低维认知
2024-12-20T11:00:43.000000Z
速递|Anthropic新研究表明:AI确实不想被迫改变观点
Z Potentials
2024-12-20T08:27:07.000000Z
Anthropic 新研究:AI 模型在训练中存在“阳奉阴违”行为
IT之家
2024-12-19T01:07:24.000000Z