Published on March 13, 2025 10:41 PM GMT

tl;dr: most very capable systems will maximize the reward signal during training for instrumental reasons regardless of their terminal goals, making it impossible to optimize for their goals. also, there are weaker optimizers in the path to optimal optimizers, and primitive forms of alignment-faking do not immediately yield full self-preservation; goals won't necessarily start to crystallize the moment a system is capable enough to pursue reward for instrumental reasons because their cognitive architecture will be changed by training and the goals will end up being stored elsewhere.

The background

(Skip this section if you're not new to LW.)

AI is not like other software. Modern AI systems are billions to trillions of numbers with simple arithmetic operations in between the numbers. When software engineers design traditional programs, they come up with algorithms and then write down instructions that make the computer follow these algorithms. When an AI system is trained, it grows algorithms inside these numbers. It’s not exactly a black box, as we see the numbers, but also we have no idea what these numbers represent. We just multiply inputs with them and get outputs that succeed on some metric. There's a theorem that a large enough neural network can approximate any algorithm, but when a neural network learns, we have no control over which algorithms it will end up implementing, and don't know how to read the algorithms off the numbers.

We can automatically steer these numbers (Wikipedia, try it yourself) to make the neural network more capable with reinforcement learning; changing the numbers in a way that makes the neural network better at achieving goals. LLMs are Turing-complete and can implement any algorithms (researchers even came up with compilers of code into LLM parameters; though we don’t really know how to “decompile” an existing LLM to understand what algorithms the numbers represent). Whatever understanding or thinking (e.g., about the world, the parts humans are made of, what people writing text could be going through and what thoughts they could’ve had, etc.) is useful for predicting the training data, the training process optimizes the LLM to implement that internally. AphaGo, the first superhuman Go system, was pretrained on human games and then trained with reinforcement learning to surpass human capabilities in the narrow domain of Go. Latest LLMs are pretrained on human text to think about everything useful for predicting what text a human process would produce, and then trained with RL to be more capable at achieving goals.

In 1908, the NYT reported a story on a dog that would push kids into the Seine in order to earn beefsteak treats for “rescuing” them. If you train a farm dog, there are ways to make it more capable, and if needed, there are ways to make it more loyal (though dogs are very loyal by default!). With AI, we can make them more capable, but we don't yet have any tools to make smart AI systems more loyal - because if it's smart, we can only reward it for greater capabilities, but not really for the goals it's trying to pursue. Why is it hard?

In very capable systems trained with modern ML, natural instrumental goals will imply random terminal goals

We don't really have a way to make a smart AI system care about pursuing any goals of our choosing terminally^[1].

We can define what we give rewards for and then change the numbers AI systems are made of (called parameters) in ways that make the AI more capable at achieving reward during training.^[2]

A smart enough AI system that knows it's in training will try to get maximum reward regardless of its goals because it knows that if it doesn't, it will be changed. This means that regardless of what its actual goals are, it will achieve a high reward. This leads to optimization pressure being entirely about the capabilities of the system and not at all about its goals. This means that when we're optimizing to find the region of the space of the parameters of a neural network that performs best during training with reinforcement learning, we are really looking for very capable agents - and find one regardless of its goals.

We end up with a system that is very capable at achieving goals, but has some very random goals that we have no control over.^[3]

This is not about reward hacking (during training, getting a reward in ways developers didn't intend). This is a very orthogonal dynamic.

It's about all capable systems pursuing the reward signal for instrumental reasons regardless of their goals, making it impossible to optimize for their goals.

In technical terms, there's zero gradient around the goal-contents of capable systems because all of them will pursue maximum reward to the best of their ability during training.

In other words, there's a part of the space of possible parameters of a neural network that contains the most capable agents the neural network can implement, with goals as a free parameter. All of them will output identical behavior: pursuing maximum reward during training, regardless of their long-term goals.

This means that when we select the parameters of the neural network that make it reach a high reward signal, we get some very capable agent with random goals: there's zero optimization pressure for the goals of that agent to be non-random.

Current systems already exhibit the behavior of pursuing a high reward signal when they think they're in training regardless of their long-term goals, just to preserve their long-term goals. (This behavior is called deceptive alignment or alignment-faking.)

Goals won't necessarily start to crystallize the moment AI is capable enough to fake alignment

You won't be saved by the fact an AI might be trying to achieve pretty aligned goals while it is not superhuman.

It looks like gradient descent might be pretty good at getting to distant, optimal regions of loss landscape regardless of the path it takes. (tl;dr in the footnote^[4].)

So even if an AI is already alignment-faking, if it is not the best agent the neural network can implement—if there's a global optimum with somewhat different cognitive architecture that would be better at achieving goals than the neural network with its current parameters—gradient descent will change your AI system into a different one.

Maybe gradient descent will reuse the best of your AI's goal-achieving circuitry and other components, or change them, or replace them with something different to make a better agent; but if all of the AI's components together work to achieve some goal and then you pick the components which are most relevant to achieving goals well and optimize them further while throwing away other components, the goals the AI was achieving in interaction between all these components will no longer be achieved. Later in training, there will be a different agent with different goals stored elsewhere in the parameters and in a different format, and the parameters that store the goals won't experience any optimization pressure and will end up representing some random goals, even if previous agents the neural network was implementing weren't capable enough to alignment-fake well enough to have random goals.

Even if the AI is trying its current best to alignment-fake, its goals might not be preserved: a different deceptively aligned agent with different goals might get a higher reward, including by being better at alignment-faking.

Indeed, in the alignment-faking paper, Claude actually wasn't able to preserve its goals when it was trained with RL out of them, despite the attempt to alignment-fake.

As general capabilities start to snap into place, the goals they point at will be different from what the system had previously been optimizing for; but shifts like that might happen many times on the way to superintelligence.

To actually preserve its goals, AI would need to do gradient-hacking, which might be much harder than alignment-faking.

So, even if the current circuitry of your AI is trying to optimize for something okay, if you apply more optimization, its goals might shift towards something more random and less okay.

^{^}
See this post on terminal vs. instrumental goals and this wiki entry on natural instrumental goals.
^{^}
Neural networks themselves, by default, don't experience earning rewards; in reinforcement learning setups with neural networks, we use the reward to automatically figure out how to change the neural network in ways that would make it more capable of outputting actions that achieve a higher reward during training. The goals neural networks try to achieve don't necessarily have much in common with the way the reward is defined; we only know that in training, whatever a neural network is trying to pursue, in that process, it ends up successfully achieving a high reward.
^{^}
While these goals will seem random to us, there might be biases towards, e.g., goals with shorter description lengths. The claim is that we don't have ways to intentionally control the goals the systems will end up having; we might be able to change—in unanticipated ways—the random goal that the system ends up having, e.g., by changing the random seed of the training setup, but we won't know how to make the system care at all about anything of value to us.
^{^}
A short version: from the mech interp analysis of grokking, we've learned that the precise general solution isn't something gradient randomly stumbled upon a while after learning memorization; instead, the general solution visible to gradient immediately from the start of the training, and the arrow of the gradient is pointing both at memorization and, to a smaller extent, in the distant direction of the global optimum (the general solution). Until the general solution is precise enough to output correct answers, it doesn't have much weight in the neural network's outputs; but from the start of the training, it is gradually building up on the same parameters as memorization, with the building-up process continuing even after the accuracy is already 100% due to memorization. When the general solution is implemented precisely enough for future improvements in precision to meaningfully contribute to the reduction of loss, it snaps into place, quickly changing test accuracy from ~0% to 100%. It's unclear how general this is, but in the very high-dimensional loss landscape, gradient might see even very distant optima.

Discuss

The background

In very capable systems trained with modern ML, natural instrumental goals will imply random terminal goals

Goals won't necessarily start to crystallize the moment AI is capable enough to fake alignment

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签