The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?

Published on May 28, 2025 6:21 AM GMT

This is a link-post for an exciting paper I recently read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al.

For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on a lot of examples of aligned behavior.

I've regarded this approach as a major advance ever since I read the seminal 2023 paper on the topic: Pretraining Language Models with Human Preferences by Tomasz Korbak et al., and I'm absolutely delighted to finally see someone else publish another paper on this approach — I'm only sad it has taken so long.

I highly encourage everyone interested in AI alignment to go read both of these papers (if you haven't already) — between them they strongly suggest that the authors have found a more effective way to align an AI: an alignment approach better than any that people are (as far as we know) currently using. I believe this is extremely important: I see it as major progress on alignment. So I think it directly reduces the p(DOOM) for the most critical current x-risk to our entire species.

For more detailed expositions of this approach and why I think it's an excellent idea, see my previous posts How to Control an LLM's Behavior (why my P(DOOM) went down), A "Bitter Lesson" Approach to Aligning AGI and ASI, and Why Aligning an LLM is Hard, and How to Make it Easier.

(I'm also delighted that the authors of the recent paper tested out some of the follow-on ideas I'd been proposing in those posts on Less Wrong. One was training the model to generate control-tag tokens that label portions of the text as good or bad behavior, and then for conditional generation altering the token generation process, leveraging these tokens, so as to induce the model to behave well not badly. Another was using synthetic data editing to modify problematic raw training examples by supplementing them with more moral or correct behavior or commentary. They elaborated on both of these, or independently reinvented them, and even confirmed that both of these appear to work about as well as I'd been hoping.)

Hence, in order to encourage people to read this post and get to hear about these groundbreaking papers, I suggested a rather bold claim in my title: that inner alignment is now basically a solved problem.

A brief explanation for anyone wondering "what's inner alignment, and why should I care about it?"

The alignment problem is frequently broken down into two subproblems: Outer Alignment, figuring out what human values are and how to define, codify, or recognize them, and Inner Alignment, how to train our AI to agentically optimize human values and not anything else; or, as the LessWrong page on inner alignment defines it:

Inner Alignment is the problem of ensuring mesa-optimizers^[1] (i.e. when a trained ML system is itself an optimizer) are aligned with the objective function of the training process.
Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?

Some people in the alignment field consider the outer alignment subproblem to have been solved in theory over a decade ago when Value Learning was proposed, and that inner alignment is thus the hard part of the alignment problem. This viewpoint has become more widespread as it has become apparent that LLMs actually have rather detailed world models of what human values are, in all their messy, fragile complexity, suggesting that Value Learning can be performed just by pre-training an LLM, and thus that outer alignment is also soluble in practice. Perhaps we don't need to attempt to compactly define human values: the messy and complex version of them implicit from pretraining on most of the Internet may be sufficient as is. If so, this not only solves outer alignment, but significantly simplifies inner alignment: we don't need to accurately define a compact, exactly-and-everywhere-correct formal definition (i.e. a "True Name") of human values — an incredibly challenging task given just how messy, complex, and fragile they are; we can just train an LLM on a vast amount of human-generated data, and it will develop a world model of human values, along with all the other things it learns to understand about us. Now we need to get an agentic AI to care about human values and not about anything else (or at least, to act that way): that's the inner alignment problem. We just need to retarget the search.

Inner alignment (also) now being solved is admittedly a bold claim, but I believe I can justify it:

There are fundamentally only three ways that we know of to train an LLM to do anything (including aligning it):^[2]

pretraining, using Stochastic Gradient Descent (SGD) on a large datasetfine-tuning, using either SGD or more contrastive techniques such as Differential Preference Optimization (DPO) on a smaller datasetReinforcement Learning (RL), of various different types

The third of these is currently the most common approach used for alignment.

The people who originally came up with the inner alignment vs. outer alignment subdivision were thinking in the context of a reinforcement learning approach (as the choice of the phrase "objective function of the training process" in the LW definition attests). As Eliezer Yudkowsk's Sequences argued at length, and as more recent major survey papers,^[3] such as Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (2023) by Stephen Casper, Xander Davies et al., have cataloged in exhaustive detail, reinforcement learning is a very challenging technique to get right. The basic problem with RL is that it's inherently adversarial: it involves an interaction between two systems, a learner and a rater, where the learner is trying to learn how to get a good rating, and the rater is trying to ensure that the only way that the learner can get a good rating is by actually learning the desired behavior. Any flaw in the rater's ratings that lets the learner score better than it deserves (and that isn't actually harder to exploit than just doing the desired behavior) can, and almost certainly will, be ruthlessly exploited by the learner. So RL is inherently just begging to fail via Goodhart's Law:^[4] even if the ratings are correct almost everywhere, the learner is searching for any area where they are significantly overestimated, or any means of inducing overestimation errors from the rater, and will enthusiastically exploit any exploitable errors that it can find.^[5]

Since for alignment the desired behavior requires (in some cases super-humanly) intelligently doing the right thing according to criteria as messy, complex and fragile as human values, using human raters is both expensive and fallible, since humans are fallible, including being vulnerable to manipulations, such as sycophancy or flattery, that encourage errors in a particular direction, and also are less smart than any superintelligent learner they're trying to rate. On the other hand, trying to devise, construct, or train an automated rating system is both inherently challenging, and for sufficient reliability for adversarial use during RL requires that the rater be much smarter than the learner, so that it's unlikely to have any flaws that the learner can find and exploit — which makes RL impractical for training any frontier system, since we can't build a rater much smarter than the frontier learner. Thus, any alignment approach that uses reinforcement learning (which includes many techniques currently in widespread use, such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI) is going to be inherently dangerous; and as AI nears and then exceeds human capabilities this problem gets rapidly worse. So we are going to have to stop trying to use RL for alignment — it's not workable for frontier AGI or ASI.

That leaves just approaches 1: pretraining, and 2: fine-tuning. Both of these are not adversarial: they're an exercise in curating a training set that demonstrates the desired behavior, not building a rating system to rate any possible input (including adversarial ones) for its desirability. If your training set is less than perfect, a system trained from it is also likely to behave less than perfectly — but unlike reinforcement learning, there is no adversarial incentive in the training process that encourages the learner to find and ruthlessly exploit any small flaw. If your training set is 99% good and 1% bad, then a-priori you would expect a (sufficiently high-capability) AI trained from it to have behavior, that was also somewhere around 99% good and 1% bad, at least inside the training distribution: fundamentally, modulo prompting, in self-supervised SGD, the distribution you train on is the distribution you get.

99% good behavior is not perfect, but we have managed to built a functional human societies out of unreliably-trustworthy humans, and I'm fairly confident that if we had AIs whose moral judgement and alignment could be relied upon even just 90% of the time, we could construct more reliable systems out of multiple AIs (or multiple runs of the same AI with differing prompts or LoRAs), likely using techniques such as majority voting, debate, cross-checks, checks-and-balances, and fault-tolerance protocols. Converting 'pretty reliable' into 'very reliable' is a well-studied problem, in both software and organizational contexts.

Both the papers that I link to above test the pretraining approach to alignment against the fine-tuning approach — and they repeatedly and consistently find that the pretraining approach wins by significant margins. As one might expect, using a larger alignment training set induces more reliable behavior. So we now know how best to align AI: safety pretraining is the most effective and least dangerous approach. Thus, inner alignment is solved, alongside outer alignment (in my and many people's opinion). So we have an outline of a complete solution to alignment.

Note that, for both pretraining and fine-tuning, if you're using automated techniques to help curate, filter, or synthesize your training set (which you almost certainly are, especially for the pretraining approach where the dataset is extremely large), then unlike the situation for RL those only need to work well inside your training set distribution — you're not trying to build something that also works well outside that distribution, across any input that a superintelligent learner might devise to abuse it.

On reliability, while no huge pretraining data set is ever going to be perfect, we have a lot of experience at hill-climbing while using SGD: identify the failures that still happen a small proportion of the time, figure out what documents in the pretraining set inspired them and/or what deletions, modifications, or additions could reduce or prevent them, edit the training set, retrain, and iterate. Admittedly, an iteration loop that requires us to pretrain a frontier model again in each cycle is going to be slow and expensive, but both paper's results strongly suggest that we can experiment and iterate via fine-tuning and then, once we have a good solution, transfer that to pretraining for a sizable boost in its reliability. That gives us an inner and outer loop for this hill-climbing process.

It would be a fair point to argue that inner alignment is solved only in theory, and that the practical problem of curating an extremely large pretraining-sized dataset that accurately portrays and teaches both what human value are, and what AI behavior correctly aligned to those human values looks like, remains a large problem. However, that's also a well-understood and partially-solved problem, since it's inherently similar to the problems in pretraining dataset curation and synthetic data generation that many capabilities researchers have been working and making progress on over the entire history of machine learning. We can confidently expect them to continue to improve this, in the era of synthetic data. Reducing alignment to just what looks like a well-known capabilities data science task is dramatic progress.

The safety pretraining approach is also timely. We are fast running out of the highest-quality pretraining data, and will increasingly need to rely on using, or at least supplementing with, synthetic data. The recent paper very explicitly shows how to do alignment using a synthetically augmented dataset, and shows how this can be used to align an LLM's behavior to any desired set of ethical criteria. Note that safety pretraining is a "dual use" technological advance — it would also help us train a better paperclip maximizer, if we wanted to do that: we'd just need to generate a suitable pretraining dataset for it.

There are some other important ideas in these papers that I've skipped over in my argument so far, beyond just the demonstration that the safety pretraining approach is the best: there are also a few techniques that are required to get it to work that well. For instance, both papers demonstrate that it is more effective to train the LLM to understand both aligned behavior (what we want the AI to do), and unaligned behavior (which humans do lots of, so it will encounter), and train it to correctly distinguish the two and label them, then use a conditional generation approach at inference time to make it generate only aligned behavior. So the training distribution needs to include all the unaligned aspects of human behavior. The more recent paper does this at a higher level of sophistication on larger models for more challenging alignment issues, but the results are consistent with those of the earlier paper. This idea is also unsurprising: it matches how we generally raise children: we don't just teach them how to be good, they also learn (on a developmentally appropriate syllabus) what bad behavior is, how to tell the two apart, and why bad behavior is bad. These are important skills, and AI needs them too.

So, I believe inner alignment is solved, in the sense that it has been reduced to just the standard problem of training dataset curation.

Thus, if you haven't yet done so, I strongly recommend you read these two papers.

^{^}
'Mesa-optimizer' here is an older term for what we would now generally call an agent (or sub-agent): any smart ML system capable to planning and executing appropriate actions to attempt to bring about outcomes that are optimized according to some criterion.
^{^}
I exclude prompting and in-context learning, since they're not training the LLM, only conditioning its behavior on a context. Human values are complex enough aligning to them seems likely to require a very large prompt. However, for a more capable agent already sufficiently familiar with human values, or one with a very clear understanding of what aligned behavior is, a more compact prompt might be feasible.
Also, using the same argument as Fundamental Limitations of Alignment in Large Language Models (2024) by Yotam Wolf, Yaom Wies et al., any behavior that a prompt can induce will always be vulnerable to being overwritten by a suitable jailbreak or prompt-injection attack.
^{^}
Some more major papers that address this topic:
Concrete Problems in AI Safety (2016) by Dario Amodei, Chris Olah, et al.
Managing Extreme AI Risks amid Rapid Progress (2023) by Jan Brauner, Sören Mindermann et al. (coauthors including both Yoshua Bengio and Geoffrey Hinton)
AI Alignment: A Comprehensive Survey (2023–2025) by Jiaming Ji, Tianyi Qiu et al.
^{^}
Specifically, under Scott Garrabrant's taxonomy of forms of Goodhart's Law phenomena, this is "adversarial Goodhart". For a more mathematical discussion of why adversarial Goodhart very frequently occurs during Reinforcement Learning, see for example the paper Goodhart's Law in Reinforcement Learning.
^{^}
This problem is worse for online reinforcement learning, where the learner has control of the distribution of episodes to be rated, and thus the ability to locate and then abuse flaws in the rater's performance no matter where they may be. Whereas in offline reinforcement learning, where the rated episodes are drawn from some other distribution not controlled by the learner, the rater only needs to be able to do a sufficiently good job of rating everywhere across whatever distribution of episodes is being used, rather than absolutely everywhere. So while the relationship between the rater and the learner is still adversarial in both, the learner's advantage over the rate is more constrained in offline RL than in online RL. Thus both are dangerously prone to Goodharting, but online RL is the worse of the two. Unfortunately online RL is typically what is used to align LLMs.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签