少点错误 05月01日 19:17
Dont focus on updating P doom
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章指出,当前AI对齐讨论过多集中于末日概率(P(doom)的变化,这不利于实际问题的解决。作者认为,关注具体的机制、策略选择等细节,例如“涌现的未对齐不等于越狱”等更具价值。这种关注点的转变能促进更有效的社会模仿学习,从而推动AI对齐研究的进展。与其关注抽象的末日概率,不如深入研究具体的技术细节和潜在风险,从而做出更明智的决策。

💡当前AI对齐讨论的焦点是末日概率(P(doom))的变化,但这种关注点并不利于政策制定者、对齐研究人员和工程师改进决策或预测未来。

🧪关注具体的机制、策略选择等细节,例如“涌现的未对齐不等于越狱,这意味着存在与“坏事”相对应的多个向量!”,这比简单地讨论末日概率更有价值。

🗣️与其关注抽象的末日概率,不如深入研究具体的技术细节和潜在风险,从而做出更明智的决策。如果某些信息与具体内容无关,则减少关注。

Published on May 1, 2025 11:10 AM GMT

Motivation: Improving group epistemics.

TL; DR (Changes to) P doom/alignment difficulty are a shibboleth dominating conversations, distorting epistemics. Instead, focus on updates to your gears level models. Focus on near and concrete details instead of far and vague abstractions. 


People frequently opine on whether some alignment news is good or bad for alignment. "Training AI on insecure code makes it swear! That's good for alignment! P doom is down!" Or "Training AI on insecure code doesn't make it jailbroken! That's bad for alignment! P doom is up!"

On the margin, this is not helpful. It focuses group attention on how one single number, "P(doom)", moves. (Or, perhaps worse yet, how this changes the difficulty of "alignment"). 

Why is this bad? For two reasons, leaving aside the illegibility of "Doom". Firstly, it isn't especially useful to know that P(doom) has moved a bit. It doesn't let policymakers, alignment researchers, engineers or others improve their decision-making, or help them in anticipating the future. A change in this single number doesn't automatically propagate through their world models. It doesn't tell them how to implement foom liability to reduce race dynamics, or train a model to have a faithful chain of thought or so on. How could it? There's too much inferential distance between changes to this number and all the messy details of their models which are needed to, you know, actually reduce P(doom).

Which leads into the second point. Focusing on P doom gets you get a shibboleth number (or rather, "update"). The reason for this is that it is very hard to spread more than a couple of bits of info via gossip, and group attention largely determines what those bits will be. You may well say "Ah, but I don't just list my changes to P doom, I then say why they happened!". Sorry, that's not gonna work, because the most salient common thread across all the gossip will be "P (doom) decreased/increased", as that's what people pay attention to. This number/change then gets baked into group and individual identities, becoming further divorced from the rest of the world model, which distorts our collective sense-making.

Which is why I'd like you to focus on other things instead!

What things? Details, non-meta stuff, concrete mechanisms, policy choices etc. Perhaps statements like 'Emergent misalignment ≠ jailbreaking, implying there are multiple vectors that correspond to "bad stuff"!' Or 'Emergent misalignment implies that SGD training on some bad stuff leads to strengthening many bad circuits together!' Or "Emergent misalignment found an anti-normativity vector!" Or, "Can we replicate this with activation steering?"

In an ideal world, we could be even more concrete than that, but alas, you only get a few words to choose, which forces us to remove detail. However, I maintain that the above claims are more likely to actually lead to better updating through social mimicry than "P doom went up!" or "P doom went down!".

And if some bit of news doesn't, in fact, bear on anything concrete that you can think of? Well, don't focus on that bit of news! Or less than you'd otherwise be inclined to do so. Don't direct the group's attention towards it which will probably result in very meta, unhelpful discussions.

"But Algon", you might say, "I already am concrete! I have receipts!" In which case, I say after verification "Yes, you do, and thank you for saying that! It was a genuine contribution to the conversation, and was unusually high signal-to-noise. You're doing good : )  But you can do better yet! Mention changes to p doom, or its equivalents, less frequently on the margin. It's way too meta, too abstract!" If I'm pulling a number out of my arse, I'd say less than 5% of your conversations should focus on what P(doom) is. Focus more on the details.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 末日概率 技术细节
相关文章