Published on May 1, 2025 11:10 AM GMT
Motivation: Improving group epistemics.
TL; DR (Changes to) P doom/alignment difficulty are a shibboleth dominating conversations, distorting epistemics. Instead, focus on updates to your gears level models. Focus on near and concrete details instead of far and vague abstractions.
People frequently opine on whether some alignment news is good or bad for alignment. "Training AI on insecure code makes it swear! That's good for alignment! P doom is down!" Or "Training AI on insecure code doesn't make it jailbroken! That's bad for alignment! P doom is up!"
On the margin, this is not helpful. It focuses group attention on how one single number, "P(doom)", moves. (Or, perhaps worse yet, how this changes the difficulty of "alignment").
Why is this bad? For two reasons, leaving aside the illegibility of "Doom". Firstly, it isn't especially useful to know that P(doom) has moved a bit. It doesn't let policymakers, alignment researchers, engineers or others improve their decision-making, or help them in anticipating the future. A change in this single number doesn't automatically propagate through their world models. It doesn't tell them how to implement foom liability to reduce race dynamics, or train a model to have a faithful chain of thought or so on. How could it? There's too much inferential distance between changes to this number and all the messy details of their models which are needed to, you know, actually reduce P(doom).
Which leads into the second point. Focusing on P doom gets you get a shibboleth number (or rather, "update"). The reason for this is that it is very hard to spread more than a couple of bits of info via gossip, and group attention largely determines what those bits will be. You may well say "Ah, but I don't just list my changes to P doom, I then say why they happened!". Sorry, that's not gonna work, because the most salient common thread across all the gossip will be "P (doom) decreased/increased", as that's what people pay attention to. This number/change then gets baked into group and individual identities, becoming further divorced from the rest of the world model, which distorts our collective sense-making.
Which is why I'd like you to focus on other things instead!
What things? Details, non-meta stuff, concrete mechanisms, policy choices etc. Perhaps statements like 'Emergent misalignment ≠ jailbreaking, implying there are multiple vectors that correspond to "bad stuff"!' Or 'Emergent misalignment implies that SGD training on some bad stuff leads to strengthening many bad circuits together!' Or "Emergent misalignment found an anti-normativity vector!" Or, "Can we replicate this with activation steering?"
In an ideal world, we could be even more concrete than that, but alas, you only get a few words to choose, which forces us to remove detail. However, I maintain that the above claims are more likely to actually lead to better updating through social mimicry than "P doom went up!" or "P doom went down!".
And if some bit of news doesn't, in fact, bear on anything concrete that you can think of? Well, don't focus on that bit of news! Or less than you'd otherwise be inclined to do so. Don't direct the group's attention towards it which will probably result in very meta, unhelpful discussions.
"But Algon", you might say, "I already am concrete! I have receipts!" In which case, I say after verification "Yes, you do, and thank you for saying that! It was a genuine contribution to the conversation, and was unusually high signal-to-noise. You're doing good : ) But you can do better yet! Mention changes to p doom, or its equivalents, less frequently on the margin. It's way too meta, too abstract!" If I'm pulling a number out of my arse, I'd say less than 5% of your conversations should focus on what P(doom) is. Focus more on the details.
Discuss