欺骗性对齐_Fishai

热点

"欺骗性对齐" 相关文章

Research Areas in Evaluation and Guarantees in Reinforcement Learning (The Alignment Project by UK AISI)

少点错误 2025-08-01T19:16:02.000000Z

Correcting Deceptive Alignment using a Deontological Approach

少点错误 2025-04-15T01:12:23.000000Z

Turning up the Heat on Deceptively-Misaligned AI

少点错误 2025-01-07T00:16:20.000000Z

A Dialogue on Deceptive Alignment Risks

少点错误 2024-09-25T16:10:21.000000Z

Untrustworthy models: a frame for scheming evaluations

少点错误 2024-08-19T21:51:56.000000Z

[Interim research report] Evaluating the Goal-Directedness of Language Models

少点错误 2024-07-18T18:20:59.000000Z

Copyright © 2019 FISHAI.All Rights Reserved