少点错误 05月06日 03:17
The Sweet Lesson: AI Safety Should Scale With Compute
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AI安全研究中一个关键的课题,即如何通过增加计算资源来提升AI系统的安全性。它主要关注几种研究方向,包括审慎对齐、AI控制、辩论、Bengio的科学家AI以及Anthropic和ARC风格的可解释性。这些方法都旨在通过增加推理时间或优化算法,来提高对AI行为的理解和控制。文章强调了理论分析和实践验证相结合的重要性,并提出了对于理想极限的思考,最终目标是构建更安全、更可靠的AI系统。

💡审慎对齐:结合链式思考与宪法式AI,通过推理时的计算来提高安全性。例如,Guan等人的研究通过在推理过程中引入宪法,增强了AI的安全性。

🕹️AI控制:设计对抗性控制协议,通过红队与蓝队的对抗,长时间运行游戏以更可靠地评估部署期间成功策划的可能性。这种方法旨在通过对抗来提高AI的安全性评估。

🗣️辩论:设计辩论协议,通过AI助手之间更长、更深入的辩论,增强对AI诚实或其他期望品质的信心。Irving等人提出的辩论协议旨在鼓励AI的诚实和可靠性。

🧪Bengio的科学家AI:开发安全防护措施,通过增加推理时间来获得更可靠的灾难性风险评估。Bengio等人提出的方法,随着计算资源的增加,预测会趋于准确。

🔍可解释性:开发解释性工具,如SAEs,先学习对最小化重建损失最重要的SAE特征。Templeton等人和Gross等人的研究,侧重于提高AI模型的可解释性,以便更好地理解和控制其行为。

Published on May 5, 2025 7:03 PM GMT

A corollary of Sutton's Bitter Lesson is that solutions to AI safety should scale with compute. Let's consider a few examples of research directions that are aiming at this property:

[O]ur proposed method has the advantage that, with more and more compute, it converges to the correct prediction . . . . In other words, more computation means better and more trustworthy answers[.] — Bengio et al. (2025)

In ARC's current thinking, the plan is not to have a clear line of separation between things we need to explain and things we don't. Instead, the loss function of explanation quality should capture how important various things are to explain, and the explanation-finding algorithm is given a certain compute budget to build up the explanation of the model behavior by bits and pieces. — Matolcsi (2025)

For these procedures to work, we want the ideal limits of these procedures to be safe. In a recent talk, Irving suggests we should understand the role of theory as being about analyzing these limiting properties (assuming sufficient resources like compute, data, and potentially simplifying assumptions) and the role of empirics as being about checking if the conditions required by the theory seem to hold in practice (e.g., has learning converged? Are the assumptions reasonable?).[2]

From this perspective, the role of theory is to answer questions like:

Taking a step back, obtaining understanding and guarantees about idealized limits has always been the aim of theoretical approaches to AI safety:

The main update is that now we roughly know the kinds of systems that AGI might consist of, and this enables a different kind of theory strategy: instead of studying the ideals that emerge from a list of desiderata, you can study the limiting/convergence properties of the protocols we are currently using to develop and align AGI.

In practice, we probably need both approaches. As we continue to scale up compute, data, model size, and the RL feedback loop, we might find ourselves suddenly very close to the limits, wishing we had known better what lay in store.

  1. ^

    Bengio's proposal involves a very particular technical vision for using G-Flow Nets to implement an (approximate, amortized) Bayesian Oracle that he believes will scale competitively (compared to the inference costs of the agent being monitored).

  2. ^

    Irving's talk is about scalable oversight, but the general framing applies much more generally.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 算力 可解释性 审慎对齐
相关文章