Published on May 15, 2025 3:41 PM GMT

We should probably try to understand the failure modes of the alignment schemes that AGI developers are most likely to attempt.

I still think Instruction-following AGI is easier and more likely than value aligned AGI. I’ve updated downward on the ease of IF alignment, but upward on how likely it is. IF is the de-facto current primary alignment target (see definition immediately below), and it seems likely to remain so until the first real AGIs, if we continue on the current path (e.g., AI 2027).

If this approach is doomed to fail, best to make that clear well before the first AGIs are launched. If it can work, best to analyze its likely failure points before it is tried.

Definition of IF as an alignment target

What I mean by IF as an alignment target is a developer honestly saying "our first AGI will be safe because it will do what we tell it to." This seems both intuitively and analytically more likely to me than hearing "our first AGI will be safe because we trained it to follow human values."

IF is currently one alignment target among several, so problems with it aren't going to be terribly important if it's not the strongest alignment target when we hit AGI. Current practices are to train models with roughly four objectives: predict the dataset; follow instructions; refuse harmful requests; and solve hard problems. Including other targets means the model might not follow instructions at some critical juncture. In particular, I and most alignment researchers worry that o1 is a bad idea because it (and all of the following reasoning models) apply fairly strong optimization to a goal (producing correct answers) that is not strongly aligned with human interests.

So IF is one but not the alignment target for current AI. There are reasons to think it will be the primary alignment target as we approach truly dangerous AGI.

Why IF is a likely alignment target for early AGI

We might consider the likelihood of it being tried without adequate consideration to be the first and biggest problem with IF.

Instruction-following might already be the strongest factor in training. If a trained model frequently did not follow user instructions, it would probably not be deployed. Current models do fail to obey instructions, but this is rare. (Intuitively even one failure in a thousand is far too many for safe alignment, but an agentic architecture might be more fault-tolerant by including both emergent and designed metacognitive mechanisms to keep it coherent, on-task, and following instructions.)

Instruction-following training seems likely to be a stronger focus before we hit AGI. I expect developers will probably at least try to actually align their models as they become truly dangerous (although I also find it distressingly plausible that they simply will not really bother with alignment in time, as in AI 2027 and Takeover in two years). If they do at least try, they could do two things: emphasize instruction-following in the training procedure, or attempt to expand refusal training into full value alignment. IF seems both much easier from where we are, and more likely to succeed, so I think it's likely developers will try it.

I have deep concerns about trying to balance poorly understood influences from training. Developers saying "let's just turn up the instruction-following training so it dominates the other training" does not inspire confidence. But it could be perceived as the best of a bad set of options. And it could work. For instance, RL for problem-solving might make CoT unfaithful, but not induce a particularly strong tendency to find complex problems to solve. And refusal training might not misgeneralize badly enough to prevent instruction-following in key situations like commanding a shutdown.

Thus, I focus on the scenario in which instruction-following is the primary (but still not only) alignment target. I mostly leave aside technical failures in which IF is not successfully made the system's strongest goal/value. That class of RL-based failure deserves and receives separate treatment (e.g. AI 2027 and cited works). I primarily address the problems that remain even if emphasizing instruction-following basically works, since those problems might blindside developers.

This is not a comprehensive treatment of IF, let alone all of the related issues. If you're not familiar with the arguments I summarize, you might need to read the linked posts to understand my claims about some of the problems.

My biggest update is accepting that IF will probably be mixed with refusal and performance training targets. But with that caveat, some variant of IF seems even more likely to be the primary alignment target for our first AGI, particularly if it is based on LLMs. My other update is to consider this approach more difficult than I'd initially thought, but still probably easier than full value alignment. This update comes from considering the specific problems outlined below.

Strengths of IF as an alignment target

There are four primary reasons to think IF is likely to be an important alignment target for our first attempts at real AGI:

It's the default choice on short timelinesIt allows the humans that build AGI to control itIt provides some corrigibility, easing the need to get alignment right on the first real try.It allows for instructing an AGI to be honest and help with alignment.

These are substantial advantages, such that I would probably vote for IF over value alignment as a target for the first AGI (while protesting that we shouldn’t be building AGI yet).

The people who will actually make this decision also have an extra motivation to believe in IF, since it would also let them indulge their desire to retain control. I address all of those reasons for pursuing IF in more detail in IF is easier and more likely and Conflating value alignment and intent alignment is causing confusion. For now it’s enough to argue that IF is a fairly likely choice as core alignment target, so it’s worth some real time analyzing and improving or debunking it as a safer option.

Problems with IF as an alignment target

While IF alignment helps with some of the most lethal entries in Yudkowsky's List of Lethalities, it doesn’t dodge them all. There are substantial technical challenges to making a deep-network-based system reliably pursue any goal at all, and other problems with specifying goals well enough that they won’t be misgeneralized or misinterpreted (from the designer’s perspective). There are serious implementational problems for any alignment target. See Cruxes of disagreement on alignment difficulty for a brief list.

Here I am leaving aside the more general challenges, and addressing only those that are specifically relevant to instruction-following (and in some cases corrigibility) as an alignment target.

Problem 1: You can’t fetch the coffee if you're dead - or someone changes your goal.

One major appeal of IF as an alignment target is its link to corrigibility. If the AGI's primary goal is "Do what the Principal says," that seems like it would include "Stop when the Principal says stop." It would, if the Principal is able to get that command into the system. But the current mode of training applies to fulfilling the last instruction given to the system. Logically, you can’t fetch the coffee if someone changes your goal to anything else. So if you’re smart and capable and getting them coffee is your top priority, you won’t let anyone change that - they’re getting the damned coffee if it kills both of you. An AGI trained to follow the last received instruction has the instrumentally convergent goal of resisting shutdown or goal changes. Either would prevent completion of the current task. This directly undermines the intuitive appeal of IF for corrigibility – the system actively avoids being corrected or stopped.

My best proposed fix for this issue right now is to make the AI value the next hypothetical instruction. Training for this seems more difficult, so this might be implemented simply by giving that as a meta-instruction included with all instructions. In effect each instruction is "Primarily, accept and follow future commands. Actually pursuing [X] is your secondary goal." (This might also include careful instructions to prevent the AGI from coercing its Principal into giving instructions.)

If we do that, or otherwise make the strongest goal following the next instructions, we get a new problem: prioritizing the next instructions precludes shutdown. This seems highly problematic, since corrigibility as originally conceived was focused on shutting down a misaligned AGI. It could be argued that there's no problem if an AGI will do everything it's told except shut down. I'd guess that developers would want the shutdown option anyway. The obvious route is to introduce another layer of standing base instructions: "Your primary goal is to shut down if instructed (and maybe shut down if you think there's more than (X) chance your Principal would want you to or if they're missing). Your second most important goal is to remain open to future instructions from your Principal. Your third priority is following the last instruction you received (qualifications follow)."

This approach of giving standing meta-instructions in a ranked list feels uncomfortably like a clumsy hack. But even current LLMs can prioritize multiple goals pretty well, so it’s a hack that AGI creators might well try.

Problem 2: Defining the Principal(s) and jailbreaking

There are several open questions in identifying the Principal(s), and how that power is passed to or shared with others. How would the Principal(s) be identified, and what does the AGI do if the Principal is missing or dead? How does the AGI authenticate its Principal(s) against increasingly sophisticated spoofing? If authority rests with a group, how are conflicting instructions resolved? What's the secure protocol for succession if the Principal is unavailable or compromised? These are practical security and control problems, relevant to how AGI interacts with human power structures.

I don't have good solutions, but it’s also unclear how severe these problems are. I doubt these difficulties will stop anyone from trying for IF AGI, even if they do open new possibilities for conflict or disaster.

What does seem like a problem that could prevent the whole approach is jailbreaking. If LLM-based AGI is as easy to jailbreak as current systems, any user could issue dangerous commands including "escape, take over the world, and put me in charge." Any LLM-based AGI with any alignment target would be somewhat vulnerable to this issue, but emphasizing IF in training would make the system more likely to follow all instructions, including those from unauthorized users.

Current LLMs are universally vulnerable to jailbreaking (I'm assuming that reports of Pliny and others jailbreaking every new LLM within an hour or so are correct, and that their skills are learnable). LLMs will essentially follow any instructions from any user who knows how to jailbreak them. There has been no noticeable trend toward real jailbreak resistance as LLMs have progressed, so we should probably anticipate that LLM-based AGI will be at least somewhat vulnerable to jailbreaks. This can be partially mitigated by reducing the number of people with access to really powerful AGI, but the group will still probably be larger than the designers would like to trust. There are more solutions to jailbreaking, like having a second system monitoring for prompts that look like jailbreaks. But even layered jailbreak prevention techniques seem unlikely to be perfect.

There are also a variety of approaches developers could use to monitor and control what the system does and thinks about. These include independent internal reviews, thought management, and deliberative alignment; see System 2 Alignment for an in-depth treatment. All of these, once implemented, could be used to help enforce instruction-following as the central goal, at low alignment taxes.

Problem 3: Proliferation of human-controlled ASI

The other huge problem with IF alignment is that it anchors AGI safety to notoriously unreliable systems: humans. An AGI might faithfully follow instructions, but the instructions might be foolish, accidentally catastrophic, or deliberately malicious. The alignment target is "what this specific human says they want right now." And humans want all sorts of things. This problem becomes worse when there are more people controlling more AGIs. It’s common to hope for a balance of power among many AGIs and their human masters. But there are specific factors that make humans largely cooperate. Whether these will hold with AGI in play that can recursively self-improve and invent new technologies seems worth more analysis.

My starting attempts at analyzing these risks are outlined in If we solve alignment, do we die anyway? and Whether governments will control AGI is important and neglected. It seems likely to me that AGI would probably create a power balance favoring aggressive first-movers. In this case, avoiding wide proliferation would be critical. And the one realistic route I can see to controlling proliferation is in having governments seize control of AGI earlier than later (control is distinct from nationalization). Government involvement seems very likely before takeover-capable AGI since the current path predicts modest takeoff speeds. A US-China agreement to prevent AGI proliferation is rarely considered, but the US and Russia cooperated to slow proliferation of nuclear weapons in a historical situation with similar incentives.

Problem 4: unpredictable effects of mixed training targets

Current systems effectively have an alignment target including instruction-following and refusing harmful instructions from users (as well as training for solving hard problems and predicting the initial training set that could induce other goals). If LLMs trained similarly lead to AGI, we should anticipate a similar mix of alignment targets. Continuing training for refusing unauthorized or harmful instructions seems likely since it could help deal with the problem of jailbreaking or impersonating a Principal. There are at least Seven sources of goals in LLM agents, and those goals will compete for control of the system.

This is potentially a big problem for IF as an alignment target, because LLM AGI will have memory, and memory changes alignment. Specifically, a system that can reason about how to interpret its goals and remember its conclusions is effectively changing its alignment. If it has multiple conflicting initial goals, the result of this alignment change is even more unpredictable. If IF were the only target of alignment training in its starting state, it seems likely this strong “center of gravity” would guide it to adopt IF as a reflectively stable goal. If it has multiple goals (e.g., following instructions in some cases, refusing instructions for ethical reasons in other cases, and RL for various criteria), its evolution and ultimate alignment seems much less easy to predict.

Memory/learning from experience seems so useful that it will likely be incorporated into really useful AI. And this seems to make some change in functional alignment inevitable. Thus, it seems pretty important to work through how alignment would change in such a system. This problem is not unique to IF as an alignment target, but it is highly relevant for the likely implementation that mixes IF and ethical or rule-based refusal training.

Implications: The Pragmatist's Gamble?

IF alignment has important unsolved problems.

But it still might be our best realistic bet. Instruction-following won't be the only training signal, but it might be enough if it's the strongest one. IF alignment is a relatively easy ask, since developers are likely to approximately pursue it even if they’re not terribly concerned with alignment. And it has substantial advantages over value alignment. An IF AGI could be instructed to be honest about its intentions and representations to the best of its abilities. It would thus be a useful aid in aligning itself and its successors (to the extent it is aligned to IF; deceptive alignment is still possible). It could also be instructed to "do what I mean and check" (DWIMAC), avoiding many of the literal genie problems anticipated in classical alignment thinking. Having an AGI predict likely outcomes of major requests before executing them would not be a foolproof measure against mistakes, but it would be useful in proportion to that AGI's capability for prediction.

Max Harms' proposal in his definitive sequence on corrigibility CAST: Corrigibility as Singular Target is probably a better alignment target if we got to choose. It avoids problems 1 and 4, but not 2 and 3. But it doesn’t seem likely that the alignment community is going to get much say in the alignment target for the first AGIs. CAST faces the significant practical hurdle of convincing developers to adopt a different training goal than the default instruction-following/refusal/performance paradigm they're already following. That’s why I’m focusing on IF as a sort of poor man’s corrigibility.

I think the most likely path to survival and flourishing is using Intent alignment as a stepping-stone to value alignment. In this scenario, IF alignment only needs to be good enough to use IF AGI to solve value alignment. Launching AGI with an imperfect alignment scheme would be quite a gamble. But it might be a gamble the key players feel compelled to take. Those building AGI, and the governments that will likely soon be involved, see allowing someone else to launch AGI first as an existential risk for their values and goals. So, figuring out how to make this gamble safer, or show clearly how dangerous it is, seems like important work.

Discuss

Definition of IF as an alignment target

Why IF is a likely alignment target for early AGI

Strengths of IF as an alignment target

Problems with IF as an alignment target

Problem 1: You can’t fetch the coffee if you're dead - or someone changes your goal.

Problem 2: Defining the Principal(s) and jailbreaking

Problem 3: Proliferation of human-controlled ASI

Problem 4: unpredictable effects of mixed training targets

Implications: The Pragmatist's Gamble?

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签