The Sweet Lesson: AI Safety Should Scale With Compute

Published on May 5, 2025 7:03 PM GMT

A corollary of Sutton's Bitter Lesson is that solutions to AI safety should scale with compute. Let's consider a few examples of research directions that are aiming at this property:

Deliberative Alignment:

Guan et al. 2025, Figure 13

AI Control:

probability of successful scheming during deployment (e.g., weight exfiltration)

Debate:

Irving et al. 2018, Table 1

Bengio's Scientist AI:

^[1]

[O]ur proposed method has the advantage that, with more and more compute, it converges to the correct prediction . . . . In other words, more computation means better and more trustworthy answers[.] — Bengio et al. (2025)

Anthropic-style Interpretability

Templeton et al. 2024, Scaling Laws

ARC-style Interpretability:

Gross et al. 2024

In ARC's current thinking, the plan is not to have a clear line of separation between things we need to explain and things we don't. Instead, the loss function of explanation quality should capture how important various things are to explain, and the explanation-finding algorithm is given a certain compute budget to build up the explanation of the model behavior by bits and pieces. — Matolcsi (2025)

For these procedures to work, we want the ideal limits of these procedures to be safe. In a recent talk, Irving suggests we should understand the role of theory as being about analyzing these limiting properties (assuming sufficient resources like compute, data, and potentially simplifying assumptions) and the role of empirics as being about checking if the conditions required by the theory seem to hold in practice (e.g., has learning converged? Are the assumptions reasonable?).^[2]

From this perspective, the role of theory is to answer questions like:

Deliberative Alignment

AI Control:

Debate:

such as prover-estimator debate

obfuscated arguments

Scientist AI:

Anthropic-style Interpretability

ARC-style Interpretability:

Taking a step back, obtaining understanding and guarantees about idealized limits has always been the aim of theoretical approaches to AI safety:

What are the idealized limits of prediction?

Logical Induction

What are the idealized limits of decision-making?

Functional Decision Theory

What are the idealized limits of agents?

Embedded Agents

Learning-theoretic agenda

The main update is that now we roughly know the kinds of systems that AGI might consist of, and this enables a different kind of theory strategy: instead of studying the ideals that emerge from a list of desiderata, you can study the limiting/convergence properties of the protocols we are currently using to develop and align AGI.

In practice, we probably need both approaches. As we continue to scale up compute, data, model size, and the RL feedback loop, we might find ourselves suddenly very close to the limits, wishing we had known better what lay in store.

^{^}
Bengio's proposal involves a very particular technical vision for using G-Flow Nets to implement an (approximate, amortized) Bayesian Oracle that he believes will scale competitively (compared to the inference costs of the agent being monitored).
^{^}
Irving's talk is about scalable oversight, but the general framing applies much more generally.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签