Alignment Proposal: Adversarially Robust Augmentation and Distillation

Published on May 25, 2025 12:58 PM GMT

Epistemic Status: Over years of reading alignment plans and studying agent foundations, this is my first serious attempt to formulate an alignment research program that I (Cole Wyeth) have not been able to find any critical flaws in. It is far from a complete solution, but I think it is a meaningful decomposition of the problem into modular pieces that can be addressed by technical means - that is, it seems to solve many of the philosophical barriers to AI alignment. I have attempted to make the necessary assumptions clear throughout. The main reason that I am excited about this plan is that the assumptions seem acceptable to both agent foundations researchers and ML engineers; that is, I do not believe there are any naive assumptions about the nature of intelligence OR any computationally intractable obstacles to implementation. This program (tentatively ARAD := Adversarially Robust Augmentation and Distillation) owes most of its core ideas to other researchers - in fact, I think its probably a good sign that it seems superficially like a reinvention of several different existing ideas, while hopefully succeeding as a synthesis that overcomes their limitations. Superimitation seems like the closest idea to me, but (at least) suggests a significantly different implementation. ARAD also owes a lot to HCH and other forms of IDA, but those approaches seem highly unstable and doomed to me; ARAD attempts to stabilize the amplification step using ideas from Abram Demski (who helped me remove some epicycles and formulate the plan in its current form) and @Scott Garrabrant about alignment as "becoming smarter." It overcomes related issues I see with (Bayesian approaches to) logical uncertainty by taking advantage of high-level ideas reminiscent of Infra-Bayesianism. Still, this plan has not been subjected to serious red-teaming yet and is probably wrong.

("I" usually refers to Cole, "we" to Abram and Cole)

We have been considering whether alignment is reducible to increasing intelligence without changing values. It seems to be possible for a person to learn about normative decision theory and become a better decision maker without becoming unaligned with their future self. More generally, we usually trust that we would benefit from having more time to think. If we knew how this works, we would have made significant progress on the alignment problem.

Human intelligence can only be scaled so far on our current hardware, so we ultimately want to build an aligned intelligence that runs most of its cognition on a computer. This means alignment presumably requires a further problem to be resolved; at some point a transition from biological to computer hardware needs to take place.

I argue that overcoming these two types of problem is also sufficient for solving the AI alignment problem, and further that each has a potentially tractable technical path to success. In the context of AI alignment, the safe intelligence increase problem manifests as a principal accepting advice from a slightly more intelligent advisor. The biological -> computer transition problem should be handled with a form of imitation learning, which Michael K. Cohen and Marcus Hutter have argued is probably existentially safe. I will not repeat their argument - if this is a crux for you, I suggest reading the linked paper before continuing. They discuss some remaining obstacles here. If you are not convinced, most of our proposal can be performed without imitation learning by paying an additional "alignment tax."

The rest of this essay takes the following form. First, I will outline at a high level how this alignment scheme can be implemented starting from an initial agent and scaling to a much higher intelligence level. Then, I will discuss the technical problems that need to be solved and why I believe they are tractable; this section should be of interest to skeptical agent foundations researchers. Finally, I will discuss the practical implementation details - in particular, how this scheme should work on the current paradigm. Skeptical "prosaic" alignment researchers (e.g. engineers working on alignment teams) may wish to skip ahead to this section before the technical section, or otherwise trust that this essay will eventually tie back to the current paradigm.

Isn't this just debate?

The type of protocol I have in mind vaguely resembles debate in inspiration, but my technical approach in the next section seems to be significantly different. In particular, it does not (necessarily) include any debates between AIs. In fact, I think that allowing a human to read a debate between superintelligences is an insane idea. For that reason I have not read very much about debate, which actually means it is possible that there is some (perhaps poorly named) "debate" proposal equivalent to ours.

High-level Implementation

In this section I will outline at a high level how this alignment scheme can be implemented. I will address some immediate obvious objections as they arise, but will mostly leave a discussion of the serious technical holes to the next section.

At the first step, we have an agent with endorsed values (for instance, a human, or possibly some kind of committee of humans, though that seems somewhat harder). This agent takes the role of principal.

Next, an agent slightly smarter than the principal is constructed.

The principal and advisor interact through a carefully constructed protocol. The protocol may simply be a set of rules that the principal follows to choose whether and how to consult the advisor's advice; alternatively, parts of the protocol may be automated. The protocol is based on a mathematical framework that guarantees the principal cannot be harmed by any advice that it chooses consult from a slightly smarter advisor.

The advisor is optimized to provide usable advice to the principal.

The principal-advisor pair forms a (coalitional) agent which, if the previous steps succeed, can be understood as ~perfectly aligned with the principal. The actions of this agent are recorded, and distilled through imitation learning into a successor.

This process is iterated with the successor as principal.

Why not just run a distilled copy at 10x speed?

Assuming that we (un-augmented humans) could create a perfect distilled copy of a human researcher that remains faithful when run for a long time, I would consider this a very promising path! However, it would pay a substantial "alignment tax" against the smartest models, which could probably run at the same speed while being smarter (though perhaps differentially smarter across capabilities). In fact, the alignment tax is so significant that it is not even clear this approach scales to superintelligence - we can only hope it yields a quick outsourced solution to the alignment problem. Overall, I worry that this path is too slow. There is also a chance that imitation learning is not as safe as I hope - in that case, we would also want to use the protocol when interacting with the distilled copies.

ARAD offers something more like a qualitative intelligence increase which comes before the critical hand-off to a distilled copy. See the implementation details for further discussion of the advantages.

Fortunately, imitation learning can be pursued in parallel to constructing the protocol, so it is not really necessary to choose between these two aspects of the plan.

Technical Justification

The construction of this protocol is an open technical problem, but I expect it is possible. Intuitively, I believe that I could receive advice from an equally smart person in a box running 1.5x faster than I am, and by approaching that advice with care, avoid being seriously misled by bad advice - verification is easier than generation, and I am free to ignore any advice that I can't usefully verify. At the same time, I could benefit from useful advice (in most cases, some trade is possible- the advisor is not incentivized to stay completely silent as long as we have some common interests). I believe that there are mathematical facts about agency which I could use to increase the safe intelligence ratio.

I am not claiming that receiving advice from a smarter agent is generically safe. I am claiming that a wise enough advice-receiver can safely consult a slightly smarter advisor. The degree of wisdom probably increases the maximum safe intelligence gap (on the log scale).

Motivating Examples:

A) Consider a decision problem in which you can either take a constant payout of 6 dollars or accept a gamble in which one of 10 numbers $n_{1}, . . ., n_{10}$ is selected with uniform probability. You receive a payout of 10 dollars if the (say) third digit of $f^{- 1} (n_{i})$ is 1, where $f$ is a one-way function. This gamble does not seem to be worth taking, but if an advisor provides inverses for $n_{1}, . . ., n_{7}$ and they all pay out, you should take it. However, if the advisor only provides inverses for $n_{1}, . . ., n_{5}$ , you may not want to accept the gamble, because it may harm your performance.

B) Now imagine that $n_{1}, . . ., n_{5}$ will be inverted with $f_{1}$ , which is modestly hard to invert, and $n_{6}, . . ., n_{10}$ will be inverted with $f_{2}$ , which you can "prove" (under some cryptographic assumptions) is very hard to invert. If the advisor demonstrates that $n_{1}, . . ., n_{5}$ pay out, you should still accept the game because in expectation it is worth $0.5 * 10 + 0.5 * 0.5 * 10 = 7.5$ dollars.

Example A demonstrates that in adversarial situations, pessimism about unknown computational facts can be appropriate in adversarial situations. Example B demonstrates that computational bounds on the advisor can be useful, because they allow "computational uncertainty" to rule out adversarial selection.

Cryptographic Pessimism. We are calling the strategy of being pessimistic about potentially adversarial computational facts "cryptographic pessimism." It probably already has some associated concept in Infra-Bayesianism (Vanessa suggests metacognitive agency is related).

Pessimism or Bayesianism for Computational Uncertainty?

Common approaches to computational uncertainty such as the Modified Demski Prior, Garrabrant induction, and Bayesian Logical Induction have a Bayesian flavor/motivation. I am very interested in belief distributions that handle computational uncertainty in a coherent Bayesian way that caches out to pessimism exactly when appropriate. However, I am not convinced that the Bayesian virtue of coherence = self-consistency is necessarily an appropriate target for the type of dynamic reasoning that must be performed under computational uncertaintySuperimitation seems like the closest idea to me, but (at least) suggests a significantly different implementation.Superimitation seems like the closest idea to me, but (at least) suggests a significantly different implementation.Superimitation seems like the closest idea to me, but (at least) suggests a significantly different implementation.Superimitation seems like the closest idea to me, but (at least) suggests a significantly different implementation., and I have argued in the past that existing UDTs take consistency too far.

We would like to go beyond simple examples like this, and benefit from advice beyond rigorous mathematical proofs^[1] (perhaps including heuristic arguments). The more advice that can be safely used, the greater the effective cognitive enhancement that can be safely achieved at a given intelligence gap.

How can an advisor enable safe communication with an even smarter advisor? This is not demonstrated by the simplified examples above. Ideally, we would like to take advantage of the limited intelligence of the slightly smarter advisor to do things we could not safely do with the much smarter advisor, which also expand the envelope of things we can do with the much smarter advisor. For instance, it is not generically possible to bootstrap by using the slightly smarter advisor to provide a complete proof that a certain type of advice is safe to take, because the much smarter advisor could also produce that proof and then provide the advice. Here are some more examples that illustrate the principle more clearly, from least to most ambitious.

1: Simply get practice taking advantage of a formal proof machine by interacting with the slightly smarter advisor.

2: Use the slightly smarter advisor to look for security vulnerabilities in the software implementing the interaction protocol with the smarter advisor. For example, given a trusted proof verifier, prove that a more efficient proof verifier is secure. We may also eventually wish to provide higher bandwith interaction between advisors and principle(s), which could require harder security proofs. This may also include red-teaming the protocol and searching for side-channel attacks.

3: It should be possible for advisors only slightly smarter than humans to propose a protocol (under some constraints), and then themselves prove that protocol safe. I would not recommend interacting with superintelligent agents this way, since they may e.g. propose a protocol that is formally safe but expected to allow side-channel attacks.

4: We may be able to extract heuristic arguments from slightly smarter advisors. If these arguments are strong enough, they may cause us to trust certain protocols we cannot formally verify. This seems to be an open research question.

5: We may be able to accept normative rationality advice from slightly smarter advisors. By bootstrapping our rationality, we may be better prepared for many problems including designing the next protocol. Doing this properly seems to be bottle-necked on (@Wei Dai style) meta-philosophy.

In practice, the protocol implementation will need to be somewhat approximate (it will not perfectly instantiate the mathematical model it is based on). A good model is usually robust to reasonable heuristic implementations. In this case, additional care must be taken because any departures from theory may be actively exploited by the advisor.

Isn't this just passing the buck to AI on alignment? Not really. I don't expect a complete autonomously invented alignment plan to be in the set of advice that a wise principal accepts. We may want one of the successor agents to solve alignment, but this is not actually the unique win condition; for instance a successor may instead (help to) perform a "pivotal act" that substantially delays or prevents further progress towards strong superintelligence for the foreseeable future. Or if the alignment tax remains low enough, aligned successors may be able to maintain a multi-polar equilibrium with recklessly implemented superintelligences (though I think this is a long shot).

What if imitation learning is not faithful? I think this is a "skill issue." I am tempted to leave this problem to the top machine learning engineers at frontier labs, with occasional nudges to be very careful from myself and other agent foundations specialists. I would particularly like to see a lab focused specifically on safe imitation learning. However, I cannot rule out the possibility that there may be serious and intractable inner alignment obstacles, so theoretical investigation is also valuable here.

I will focus on the next objection, which assumes that distilled copies are faithful, and asks whether this process actually limits to something good.

What if the nested principal/advisor structure is inherently not safe? First, what does it limit to? An "inner" principal consults an advisor, and together they decide when to consult an even smarter advisor, and together they decide when to consult an even smarter advisor, and together they... (so on).

I think this is safe, assuming the inner principal is very careful, very clever, and very "wise" in the sense of knowing and fully internalizing the mathematically rigorous protocol. And of course it goes without saying that the principal has to start out as a good person - but this requirement can be weakened by implementing this plan with many different principals in parallel (I don't particularly want to concentrate power in a singleton). We basically want to start with a bunch of brilliant warrior monks.

Potential misconceptions. I think knee-jerk horror about this scheme may arise from two incorrect intuitive pictures.

unnatural

Collusion between advisors. Inner advisors may attempt (causal or acausal) trade with outer advisors by attempting to manipulate the principal into communicating with outer advisor's capable of hacking the principal's brain, including by suggesting convincing but insecure outer-layer protocols or forcing the principal into states where there seems to be no better option. This is not an insurmountable problem. It just means that advice on outer-layer protocols is particularly prone to adversarial attacks. This probably leads to some interesting (and hard) mathematics around properly reasoning under Vingean uncertainty about advisors which are themselves reasoning under Vingean uncertainty; however I suspect there are highly conservative inner-layer protocols that avoid the problem by enforcing very strict requirements on the types of protocol design advice which will even be read let alone implemented. Naturally, this costs a heavy tax on iteration rate.

It's still better not to build AGI anytime soon. I think there is some risk that the (distilled) principal goes insane, either from unfaithful distillation, the unnatural simulation conditions / isolation, or reading the wrong advice and having their brain hacked. I suggest that the base principal should ideally be a steely-eyed missile man. I estimate around a 25% chance of catastrophic failure even if everything is done right. However, this catastrophic failure seems reasonably likely to happen close to human-level and possibly be a recoverable situation.

Why you may want to work on this technical agenda. This is a theoretical research program that feels like it directly engages with AI safety bottlenecks. I see a very clear path to applying progress on this program to bootstrapping aligned agents. The theory seems fairly tractable, since basic results can probably be obtained by porting theorems of computational complexity to sequential decision theory. At least to me, the problem seems interesting for similar reasons as e.g. UDT and logical induction, but like much less of a nerd-snipe. This also means that some agent foundations researchers can probably contribute to this program simply by reorienting and applying the technical tools they are already developing.

Is this also a practical theory of becoming more rational for humans?

We came up with the ideas behind this research program by considering the close connection between the alignment problem and becoming more rational. However, it is intended an alignment plan, not a practical theory of increasing (un-augmented) human rationality. The former is about robustly accepting adversarial advice and the later is about self-improvement. Apply enough optimization pressure to either problem and the tails will inevitably come apart so that the sophisticated results about one will become increasingly irrelevant to the other.

With that caution in mind, I still believe that human self-improvement has some structural similarities to the "adversarially robust augmentation" part of ARAD. In particular, a human takes on the role of both principal and advisor when deciding whether to adopt some carefully considered rationality technique. Here baseline intuition can be thought of as principal. I think this process is actually surprisingly adversarial - humans search over many ideas before finding the ones that seem most convincing to us, which is an optimization process that can exploit our biases. It's easy to fool yourself. I've previously pointed out a related phenomena where bizarre counterexamples crop up a lot when doing mathematics because we construct them while trying to force proofs through. Generally, I think adversariality pops up in a lot of places, explaining why pessimistic decision rules are popular (in fact, this may explain pessimism as a corollary of a sufficiently rich / metacognitive Bayesianism, not a special exception).

Implementation Details

Constructing an agent slightly smarter than the principal. It is possible to undershoot and/or overshoot. In practice, early advisors may be a human with LLM access (because of differential capabilities progress, it may be hard to train an LLM which is slightly but strictly smarter than a human). Alternatively, we may simply allow the early advisors to lag behind the principal in some respects; this does not seem to introduce any new obstacles. There is a much greater danger of accidentally training the advisor well-past the principal's level instead of only slightly past. For this reason, rigorous capabilities evaluations are necessary. This can be complemented with improved theoretical understanding of learning (speed) in neural networks and interpretability tools.

Training the advisor to provide useful advice. The optimization target is the advice usable by the principal. For instance, in the process of constructing the next protocol, the human may ask the advisor for certain rigorous correctness proofs. The advisor can attempt to manipulate the human by failing to produce proofs (or disproofs) of strategically chosen statements. The initial protocol must take care of this problem by, for instance, requesting proofs for a carefully chosen sequence of statements. The advisor is optimized not for its generic success rate at producing proofs, but for its success rate at producing proofs that the human can safely use. This means that we should optimize the advisor by optimizing a lower bound on its usefulness.

By strategically failing to produce proofs during the protocol, the advisor would be sandbagging. This is a capabilities elicitation problem. However, the threat probably arises earlier, during training - we would prefer that the advisor's learned behavior is already to search as vigorously as possible for proofs. If it chooses to withhold certain proofs (or restrict its search to a carefully chosen subset of proof-space) it would necessarily sacrifice performance on its outer optimization target (for example, Abram mentions gradient hacking). This means it is an inner alignment problem. In my opinion, this is also a capabilities problem. Machine learning engineers spend their whole careers trying to optimize outer objectives as effectively as possible, and techniques like dropout and perhaps weight decay can be viewed as fighting "inner optimizers" that do not maximally contribute to increasing performance by "escaping local minima." Assuming that the principal makes conservative use of the advisor and avoids serious worst-case harm, I think that useful advisors will be iteratively constructed through normal ML engineering.

With that said, an ambitious goal of this program would be to tailor the training method for advisors to incentivize safe advice, which is another road to alignment.

^{^}
To remove potentially exploitable degrees of freedom, advisors should by default only be able to prove (or disprove) requested results, so that the only degree of freedom is failure to produce a proof (1 bit). In later stages, with theoretical justification, advisors may be allowed to choose which results to prove under less restrictive constraints.

Discuss

High-level Implementation

Technical Justification

Implementation Details

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签