少点错误 07月19日 00:33
Why it's hard to make settings for high-stakes control research
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了在AI控制研究领域,特别是围绕模型安全性和风险部署的实验设置开发所面临的复杂性。作者指出,相较于设计和测试控制协议,构建一个能够有效评估模型行为的实验环境本身更具挑战性。这包括需要精心设计的任务数据集,这些任务需要让前沿模型表现出色,但对较弱模型则构成显著困难,以避免过度依赖特定模型。同时,定义清晰且可验证的“安全失败”标准(即“旁侧任务”)也是关键,但往往难以客观量化,并可能限制了研究协议的通用性。文章还强调了AI在生成和评估此类复杂任务时的局限性,以及研究中避免“泄露”旁侧任务信息的重要性。作者以自身经验为例,反思了在早期项目(如APPS和Ctrl-Z)中低估了设置开发难度的问题,并对在当前AI应用中直接应用控制策略的有效性表示怀疑,认为其更多是为未来做准备,而非直接解决当前的内部风险。

🎯 **设置开发是AI控制研究的核心挑战:** 研究人员在进行AI控制实验时,发现构建一个包含可验证的主任务成功标准和不可接受的安全失败(旁侧任务)的实验环境,比设计控制协议本身更为困难。这需要精心设计任务,使得前沿模型表现良好,而较弱模型则难以胜任,以避免过度依赖特定模型。

⚖️ **任务与安全失败的定义难度:** 理想的实验设置需要能够准确衡量主任务完成情况,并对安全失败有明确、客观的定义。然而,特别是在AI能力尚未饱和的任务上,AI自身难以生成足够难度的任务,也难以准确评估任务完成情况,这给研究带来了很大阻碍。

⚠️ **旁侧任务设计的复杂性与局限性:** 定义清晰且与主任务紧密相关的“旁侧任务”是评估AI安全性的关键,但其设计充满挑战。过于主观或与主任务关联不大的旁侧任务会扭曲研究结果。例如,在Ctrl-Z项目中,旁侧任务(如网络交互)与许多主任务不相关,导致研究方法受到限制。

💡 **AI自主生成攻击的优势:** 鼓励AI自主生成攻击策略而非依赖研究人员提供的攻击思路,不仅能更真实地评估模型能力,还能减少研究中的“过拟合”问题。然而,AI在生成复杂且有意义的攻击时仍面临挑战。

🏢 **研究领域偏好与现实应用差距:** 作者倾向于研究AI滥用计算资源和专有模型的情况,尤其是在软件工程和AI研发任务中。但目前的AI应用场景中,AI的自主性和对危险资源的访问权限有限,这与作者担忧的未来高风险场景存在差距,使得直接从现实部署中学习控制策略的有效性受到限制。

Published on July 18, 2025 4:33 PM GMT

One of our main activities at Redwood is writing follow-ups to previous papers on control like the original and Ctrl-Z, where we construct a setting with a bunch of tasks (e.g. APPS problems) and a notion of safety failure (e.g. backdoors according to our specific definition), then play the adversarial game where we develop protocols and attacks on those protocols.

It turns out that a substantial fraction of the difficulty here is developing the settings, rather than developing the protocols and attacks. I want to explain why this is the case.

In order to do high-stakes control research, you need a dataset of tasks with the following properties:

Here's why it's challenging to develop settings that work.

To make the situation more challenging, we have preferences over the kinds of domains we study. Because my main concern is AIs misusing their access to compute and proprietary models while doing ML R&D, I'm most interested in settings where the AIs are doing software engineering or using computers for tasks similar to those done by researchers in AI companies.

When we were writing the first paper in the APPS setting, I did not appreciate how much harder it would be to construct settings for subsequent projects. And I really messed up by not putting more emphasis on this at the start of the Ctrl-Z project.

Incidentally, I don't think that we can easily satisfy these desiderata by trying to apply control in current real applications of AI agents. Realistically, AIs right now are not given very much autonomy (mostly because they can't use it productively) or very much access to dangerous resources. So current deployments don't provide us with example trajectories that have that much in common with the situations I'm worried about later. (I think that studying AI-control-like techniques in real deployments is a promising project for researchers inside AI companies, but mostly because I think it helps prepare the organization for implementing similar techniques later, not because it is a good path to directly learning about how to use these techniques to mitigate insider risk from schemers.)



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI控制 模型安全 实验设计 风险评估 AI伦理
相关文章