Published on June 9, 2025 3:43 PM GMT
To the extent we believe more advanced training and control techniques will lead to alignment of agents capable enough to strategically make successor agents -- and be able to solve inner alignment as a convergent instrumental goal -- we must also consider that inner alignment for successor systems can be solved much easier than for humans, as the prior AIs can be embedded in the successor. The entire (likely much smaller) prior model can be run many times more than the successor model, to help MCTS whatever plans it's considering in the context of the goals of the designer model.
I've been thinking about which parts of AI 2027 are the weakest, and this seems like the biggest gap.[1] Given this scenario otherwise seems non-ridiculous, we should have a fairly ambitious outer alignment plan meant to compliment it, otherwise it seems extraordinarily unlikely that the convergent alignment research would be useful to us humans.
Since modern training hasn't solved inner alignment, and control techniques do not make claims on inner alignment, it seems like the default path (even in the most optimistic case scenario) would be successfully aligning world-changing models only to the not-known-to-be-bad but randomly-rolled values of the system doing the alignment research, which seems nearly useless.
I'd like to zoom in on one particular element of their proposal as well: "Why is it aligned? Whereas Safer-1 had basically the same training as Agent-4, Safer-2 has a new training method that actually incentivizes the right goals and principles instead of merely appearing to. They were able to design this training method by rapidly iterating through many different ideas, and contrasting the intended result (the Spec) to the actual result (the thoughts in the chain of thought)."
This approach is unlikely to work for the very obvious reason that only some flaws will be apparent. Let's imagine half of the potential alignment issues are sufficiently obvious you could notice them before anything bad happens through the combination of chain of thought observation, model activation tracing, and adversarial testing. If you need to cycle through 99 models and only get one that meets your requirements on model 100, you should expect only a 1% chance you also accidentally also avoided all of the non-obvious problems. I'm not sure how seriously to take this model of safety development, but suffice it to say modern "safety" approaches for commercial models already rely on "well, we removed all the obvious problems, that's the same as obviously having removed all the problems, right?" and it demonstrably produces hilarious results.
In short, I think there is a strong case, for those with worries as moderately-sized as the authors of AI 2027, to push, now, for investment into outer alignment work in the hopes we can integrate it later with the inner alignment work we anticipate could be done later. For what it's worth we may already have the test models to engage in systematic moral learning (e.g. environments with groups of agents with different tasks to achieve can generate information to grade lottery-of-birth style moral analysis, or in other words, we could plausibly check if a model helps create an environment it would want to be in as a randomly selected agent -- it would be a trivial, obvious training procedure to begin to poke at questions relevant to outer alignment work).
- ^
Although I do not think it's important to the mainline analysis, I was concerned at how often AI users were imagined as monoliths, which I think overstates how much the economy will change in the medium-term (which in turn will make the prediction appear worse than it is during that medium-term). For instance, it hardly matters how advanced an AI gets, for many existing companies and workflows. There can be learning, development, and asset lock-in effects meaningful enough that if an AI can offer "replace your supplier with X", then the company receiving the offer will itself be duplicated. The primary mechanism for automation is probably in firms being out-competed, but contract cycles are longer than you might guess, and so it's simply a much slower method of AI-economy integration than is presented. Instead of imagining generic 'robot economies' we should probably use a more fine-grained analysis about local incentives and information transfer within the economy. At some point a relatively high percentage of jobs just might turn out to be low-additional-cost tasks composed largely of someone taking responsibility for something. If your analysis is that automation reduces the number of QA Analysts while making coding far more productive, for instance, you should probably spell out why. The more obvious analysis would suggest the opposite, until the world is far too weird to make meaningful projections.
Discuss