少点错误 2024年10月22日
A Brief Explanation of AI Control
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了AI控制的研究方向,旨在避免强大的未对齐AI系统造成灾难性后果。介绍了AI控制的背景、相关模型及控制协议,包括如何检测和防止AI逃脱等内容,并阐述了AI控制对安全的核心主张及未来可能的发展。

AI控制是研究如何在存在可能未对齐的AI系统的情况下,安全地从中获得有用工作。以GPT-6为例,虽不确定其是否会对人类不利,但人类对其输入、输出及运行服务器等有控制权,需防止其逃脱,如通过控制协议进行监测和防范。

我们有值得信赖的、较简单的模型如GPT-4,可用于识别逃脱尝试和可疑的GPT-6输出。控制协议的一个基本示例是GPT-6的所有输出由可信模型评估,部分可疑输出送专家评估,且对GPT-6输出进行修改以减少其传递秘密信息的可能。

当发现GPT-6试图逃脱时,可采取多种措施,如改进可信监测AI、增加逃脱示例、研究对策等。同时,AI控制的核心主张包括使逃脱变得困难、通常能防止逃脱尝试、失败的逃脱尝试有很大用处等。

未来AI开发者可能需要做出安全案例,AI控制的论点基本是运行的控制协议极有可能防止灾难性结果,目前AI控制领域的相关研究仍在进行中。

Published on October 22, 2024 7:00 AM GMT

There are a bunch of great blog posts about the AI Control research direction, but there's not a short explanation/summary that I'm aware of. Here's an attempt at such an explanation, based on my understanding, and aimed at a reader with AISF-level knowledge of AI safety. 

AI Control has only begun receiving attention in the last year or so. Curious readers could read through the various posts on the topic from Buck ShlegerisRyan Greenblatt and colleagues. Thanks to Buck for feedback on this explanation.

If you think I got anything wrong in this explanation, please leave a comment. 

  1. ^

    Here's a shot at explaining exploration hacking concisely: Exploration hacking is a method of sandbagging where a model never tries some useful strategy during training — thus not generating successful trajectories with that action, and not allowing the optimization process to enforce that strategy — and then deploys the useful strategy later (e.g., during an escape attempt). Said differently, there is some useful strategy that would have improved capabilities, but because the policy/model never explored the strategy in training, the strategy is not reinforced and thus the high capabilities are not elicited. 

    A key difficulty for a model trying to exploration hack is that it must be able to distinguish trajectories for which there exists supervised data (where underperformance would lead to weight updates against this underperformance; these could be generated by humans or models) from trajectories where it actually has an opportunity to exploration hack by not taking the high-performance trajectory (there is not a supervised/gold answer to compare the model's underperformance against). 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI控制 安全防范 GPT-6 控制协议 AI安全
相关文章