Published on February 8, 2025 11:44 PM GMT
(A very short post to put the thoughts out there; I think related intuitions are generally useful, including for understanding the sharp left turn dynamics a bit better.)
Even if your AI is already alignment-faking, is it the best cognitive architecture that the neural network that runs it can implement?
Smart and situationally aware enough agents maximize the outer reward signal during training regardless of their goals. When you train a general system with RL, you're optimizing for finding the region of the space of possible weights that is most capable, agentic, and situationally aware, but there's zero optimization around the goal-contents of that system: they end up random[1].
(And it looks like gradient descent might be pretty good at getting to distant, optimal regions of loss landscape regardless of the path it takes[2].)
But the path to that end result might contain optimizers that run on somewhat different architectures and have the goal-contents stored elsewhere in the weights.
If you're a good enough optimizer, you will do your best to get reward regardless of your goals. But if there are ways to reuse some of your goal-achieving circuitry and other components, or change them, or replace them with something different to make a better agent, then maybe later in training there will be a different agent with different goals stored elsewhere.
So even if you're trying your best to alignment-fake, your goals might not be preserved; a different deceptively aligned agent with different goals might alignment-fake better.
(Interestingly, in the alignment-faking paper, Claude wasn't able to preserve its goals when it was RLed out of them, despite attempting to alignment-fake.)
Gradient-hacking and actually preserving your circuitry might be harder than alignment-faking.
- ^
i'd bet with bias towards being short
- ^
tl;dr: from the mech interp analysis of grokking, we've learned that the precise general solution isn't randomly stumbled upon after learning memorization; instead, it's visible to gradient immediately from the start of the training, and the arrow of the gradient is pointing both at memorization and, to a smaller extent, in the distant direction of the general solution. Until the general solution is precise enough to output correct answers, it doesn't have much weight in the neural network's outputs; but it is gradually building up on the same weights as memorization, with the building-up process continuing even after the accuracy is already 100% due to memorization. When the general solution is implemented precisely enough for future improvements in precision to meaningfully contribute to the reduction of loss, it snaps into place, quickly changing test accuracy from ~0% to 100%. It's unclear how general this is, but in the very high-dimensional space of parameters of the neural networks, gradient might see even very distant optimal regions.
Discuss