少点错误 2024年07月27日
How the AI safety technical landscape has changed in the last year, according to some practitioners
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

自2023年9月以来,AI安全领域的关注点和技术策略发生了显著变化。控制理论受到重视,模型对齐的“模型生物”得到改进,稀疏自编码器在机械解释中成为焦点,危险能力评估成为实验室和政府的研究重点,对抗性稳健性和AI安全案例的讨论更加深入,信息安全的关注度也在增加。

🌟 控制理论上升:Zach Stein-Perlman指出,控制理论在AI安全领域的地位日益上升,可能成为未来的重要研究方向。

🧬 模型生物改进:匿名人士提到,如Anthropic发布的成果和Redwood未公开的工作等,各种模型对齐的“模型生物”得到了显著改进。

🔬 稀疏自编码器:Neel Nanda认为,稀疏自编码器在机械解释方面的重要性大幅提升,尽管尚未在现实任务中彻底超越基准。

🔍 危险能力评估:Neel Nanda还提到,评估AI的危险能力成为实验室、政府和其他研究者的主要关注点,技术工作与治理的直接联系更加明确。

🛡️ 对抗性稳健性:匿名人士指出,对抗性稳健性研究,特别是针对越狱攻击的防御,因商业动机和承诺而得到加强。

📚 安全案例讨论:整体上,对于如何为强大的AI制定安全案例的讨论更加广泛,Anthropic和GDM在这方面做出了真诚和合理的努力。

📜 可扩展监督论文:自去年9月以来,许多关于可扩展监督的多作者论文被发表,尽管有些实验未能成功,但这方面的研究仍在继续。

🔒 信息安全优先:匿名人士观察到,信息安全领域在Constellation中的关注度提升,网络评估和模型权重的实际保护成为主要项目区域。

Published on July 26, 2024 7:06 PM GMT

I asked the Constellation Slack channel how the technical AIS landscape has changed since I last spent substantial time in the Bay Area (September 2023), and I figured it would be useful to post this (with the permission of the contributors to either post with or without attribution). Curious if commenters agree or would propose additional changes!

This conversation has been lightly edited to preserve anonymity.

Me: One reason I wanted to spend a few weeks in Constellation was to sort of absorb-through-osmosis how the technical AI safety landscape has evolved since I last spent substantial time here in September 2023, but it seems more productive to just ask here "how has the technical AIS landscape evolved since September 2023?" and then have conversations armed with that knowledge. The flavor of this question is like, what are the technical directions and strategies people are most excited about, do we understand any major strategic considerations differently, etc -- interested both in your own updates and your perceptions of how the consensus has changed!

Zach Stein-PerlmanControl is on the rise

Anonymous 1: There are much better “model organisms” of various kinds of misalignment, e.g. the stuff Anthropic has published, some unpublished Redwood work, and many other things

Neel Nanda: Sparse Autoencoders are now a really big deal in mech interp and where a lot of the top teams are focused, and I think are very promising, but have yet to conclusively prove themselves at beating baselines in a fair fight on a real world task

Neel Nanda: Dangerous capability evals are now a major focus of labs, governments and other researchers, and there's clearer ways that technical work can directly feed into governance

(I think this was happening somewhat pre September, but feels much more prominent now)

Anonymous 2: Lots of people (particularly at labs/AISIs) are working on adversarial robustness against jailbreaks, in part because of RSP commitments/commercial motivations. I think there's more of this than there was in September.

Anonymous 1: Anthropic and GDM are both making IMO very sincere and reasonable efforts to plan for how they’ll make safety cases for powerful AI.

Anonymous 1: In general, there’s substantially more discussion of safety cases

Anonymous 2: Since September, a bunch of many-author scalable oversight papers have been published, e.g. thisthisthis. I haven't been following this work closely enough to have a sense of what update one should make from this, and I've heard rumors of unsuccessful scalable oversight experiments that never saw the light of day, which further muddies things

Anonymous 3: My impression is that infosec flavoured things are a top ~3 priority area a few more people in Constellation than last year (maybe twice as many people as last year??).

Building cyberevals and practically securing model weights at frontier labs seem to be the main project areas people are excited about (followed by various kinds of threat modelling and security standards).



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 控制理论 稀疏自编码器 危险能力评估 对抗性稳健性 信息安全
相关文章