Putting up Bumpers

Published on April 23, 2025 4:05 PM GMT

tl;dr: Even if we can't solve alignment, we can solve the problem of catching and fixing misalignment.

If a child is bowling for the first time, and they just aim at the pins and throw, they’re almost certain to miss. Their ball will fall into one of the gutters. But if there were beginners’ bumpers in place blocking much of the length of those gutters, their throw would be almost certain to hit at least a few pins. This essay describes an alignment strategy for early AGI systems I call ‘putting up bumpers’, in which we treat it as a top priority to implement and test safeguards that allow us to course-correct if we turn out to have built or deployed a misaligned model, in the same way that bowling bumpers allow a poorly aimed ball to reach its target.

To do this, we'd aim to build up many largely-independent lines of defense that allow us to catch and respond to early signs of misalignment. This would involve significantly improving and scaling up our use of tools like mechanistic interpretability audits, behavioral red-teaming, and early post-deployment monitoring. We believe that, even without further breakthroughs, this work we can almost entirely mitigate the risk that we unwittingly put misaligned circa-human-expert-level agents in a position where they can cause severe harm.

On this view, if we’re dealing with an early AGI system that has human-expert-level capabilities in at least some key domains, our approach to alignment might look something like this:

Pretraining:

Finetuning:

Constitutional AI

Audits as our Primary Bumpers:

alignment audits

Hitting the Bumpers:

generalized reward-tampering

alignment-faking

Bouncing Off:

Repeat:

repeatedly

Post-Deployment Monitoring as Secondary Bumpers:

The hypothesis behind the Bumpers approach is that, on short timelines (i) this process is fairly likely to converge, after a reasonably small number of iterations, at an end state in which we are no longer observing warning signs for misalignment and (ii) if it converges, the resulting model is very likely to be aligned in the relevant sense.

We are not certain of this hypothesis, but we think it’s important to consider. It has already motivated major parts of Anthropic’s safety research portfolio and we’re hiring across several teams—two of them new—to build out work along these lines.

(Continued in linked post. Caveat: I'm jumping into a lot of direct work on this and may be very slow to engage with comments.)

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签