少点错误 04月24日 00:12
Putting up Bumpers
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了一种应对早期通用人工智能(AGI)系统风险的策略,名为“构建保险杠”。该策略的核心在于建立多重独立的防御体系,以检测和纠正模型中的潜在偏差。作者提出,通过预训练、微调、多重对齐审计、以及部署后的持续监控,即使在对齐问题尚未完全解决的情况下,也能有效降低AGI引发严重危害的风险。这种方法类似于保龄球的保险杠,即使投球不准,也能确保球击中球瓶。

🎯 策略核心:构建“保险杠”旨在建立多重防御,及时发现并纠正AGI系统中的偏差,即使在对齐问题未完全解决的情况下,也能降低风险。

⚙️ 实施步骤:包括预训练、微调(结合人类偏好数据、宪法式AI生成的偏好数据等)、多重对齐审计(利用机制可解释性、行为红队测试等),以及部署后的持续监控。

⚠️ 关键审计:在模型构建过程中,进行多次对齐审计至关重要。通过机制可解释性、在人类/助手框架之外的提示等方法,及早发现潜在的偏差信号。

🔄 迭代改进:如果发现偏差迹象,则回溯微调过程,基于已学到的经验重新尝试对齐。重复此过程,直至系统表现出良好的对齐状态。

🧐 持续监控:系统部署后,仍需保持严密的监控措施,以便及时发现并处理新的偏差信号,确保AGI系统的安全性。

Published on April 23, 2025 4:05 PM GMT

tl;dr: Even if we can't solve alignment, we can solve the problem of catching and fixing misalignment.

If a child is bowling for the first time, and they just aim at the pins and throw, they’re almost certain to miss. Their ball will fall into one of the gutters. But if there were beginners’ bumpers in place blocking much of the length of those gutters, their throw would be almost certain to hit at least a few pins. This essay describes an alignment strategy for early AGI systems I call ‘putting up bumpers’, in which we treat it as a top priority to implement and test safeguards that allow us to course-correct if we turn out to have built or deployed a misaligned model, in the same way that bowling bumpers allow a poorly aimed ball to reach its target.

To do this, we'd aim to build up many largely-independent lines of defense that allow us to catch and respond to early signs of misalignment. This would involve significantly improving and scaling up our use of tools like mechanistic interpretability audits, behavioral red-teaming, and early post-deployment monitoring. We believe that, even without further breakthroughs, this work we can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.

On this view, if we’re dealing with an early AGI system that has human-expert-level capabilities in at least some key domains, our approach to alignment might look something like this:

    Pretraining: We start with a pretrained base model.Finetuning: We attempt to fine-tune that model into a helpful, harmless, and honest agent using some combination of human preference data, Constitutional AI-style model-generated preference data, outcome rewards, and present-day forms of scalable oversight.Audits as our Primary Bumpers: At several points during this process, we perform alignment audits on the system under construction, using many methods in parallel. We’d draw on mechanistic interpretability, prompting outside the human/assistant frame, etc.Hitting the Bumpers: If we see signs of misalignment—perhaps warning signs for generalized reward-tampering or alignment-faking—we attempt to quickly, approximately identify the cause.Bouncing Off: We rewind our finetuning process as far as is needed to make another attempt at aligning the model, taking advantage of what we’ve learned in the previous step.Repeat: We repeat this process until we’ve developed a complete system that appears, as best we can tell, to be well aligned. Or, we are repeatedly failing in consistent ways, change plans and try to articulate as best we can why alignment doesn’t seem tractable.Post-Deployment Monitoring as Secondary Bumpers: We then begin rolling out the system in earnest, both for internal R&D and external use, but keep substantial monitoring measures in place that provide additional opportunities to catch misalignment. If we observe substantial warning signs for misalignment, we return to step 4.

The hypothesis behind the Bumpers approach is that, on short timelines (i) this process is fairly likely to converge, after a reasonably small number of iterations, at an end state in which we are no longer observing warning signs for misalignment and (ii) if it converges, the resulting model is very likely to be aligned in the relevant sense.

We are not certain of this hypothesis, but we think it’s important to consider. It has already motivated major parts of Anthropic’s safety research portfolio and we’re hiring across several teams—two of them new—to build out work along these lines.

(Continued in linked post. Caveat: I'm jumping into a lot of direct work on this and may be very slow to engage with comments.)



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AGI 对齐 安全 保险杠策略
相关文章