SLT for AI Safety

Published on July 1, 2025 4:52 AM GMT

This sequence draws from a position paper co-written with Simon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gietelink Oldenziel, Stan van Wingerden, George Wang, Zach Furman, Liam Carroll, Daniel Murfet. Thank you to Stan, Dan, and Simon for providing feedback on this post.

Alignment $\subseteq$ Capabilities. As of 2025, there is essentially no difference between the methods we use to align models and the methods we use to make models more capable. Everything is based on deep learning, and the main distinguishing factor is the choice of training data. So, the question is: what is the right data?

**Figure 1: Data differentiates alignment from capabilities.** Deep learning involves three basic inputs: (1) the architecture (+ loss function), (2) the optimizer, and (3) the training data. Of these, the training data is the main variable that distinguishes alignment from capabilities.

Alignment is data engineering. Alignment training data specifies our values indirectly in the form of example demonstrations (instruction/safety fine-tuning), binary preferences (RLHF, DPO), or a constitution of abstract principles (constitutional AI, deliberative alignment). We train models against those underdetermined specifications and hope they internalize our intended constraints.

Alignment may require fundamental scientific progress. As we've seen with recent examples of sycophancy and reward-hacking, our current indirect and empirical approach to alignment might not be enough to align increasingly powerful systems. These symptoms appear to stem from deeper scientific problems with deep learning:

The Problem of Generalization

sharp left turn

The Problem of Learning

treacherous turn

Past a certain level of capabilities, making further safety progress may require making scientific progress on some of these more fundamental questions. It is hard to align what you do not understand, and we still do not understand deep learning.

**Figure 1: Key questions for a science of AI safety.** Many practical questions in AI safety are grounded in fundamental scientific questions about deep learning:
1a. (Learning) How does training data determine the algorithms that models learn?
1b. (Alignment) How can we choose training data to control what algorithms models learn?
2a. (Generalization) How do learned algorithms generalize (under distribution shift)?
2b. (Interpretability) How do a model internals enable (mis)generalization?

Singular learning theory for a science of alignment. Singular Learning Theory (SLT) is a theory of Bayesian statistics that suggests the key to understanding deep learning is the geometry (specifically, the degeneracies) of the loss function and parameter-function map. SLT provides a starting point for understanding how learned algorithms underlie generalization and how training data determines those learned algorithms:

Interpretability:

Structural Bayesianism

interpreting

Alignment:

S4 correspondence

aligning

We expect that SLT is currently the best bet for developing a more complete science of deep learning and alignment. If successful, this would enable not just a better understanding of the theoretical problems of generalization and learning, but also the development of a new set of principled tools for "reading" and "writing" model internals.

This progress would not be enough to guarantee safety on its own and would increase the risk burden on other areas in safety. However, fundamental progress may be necessary if we aren't on track to solve (inner) alignment by default.

**SLT for AI safety.** The loss landscape is where the training data, architecture, and optimizer interact. We expect that understanding the geometry of this landscape is equivalent to understanding internal structure, how it generalizes, and how to reliably control what structures arise. This enables a rigorous framework for advancing interpretability and alignment.

Outline. The rest of this sequence will survey recent progress in SLT [1, 2, 3, 4, 5, 6, 7, 8] and explore the potential applications for interpretability, alignment, and other areas of safety discussed above:

SLT for Interpretability

SLT for Alignment

SLT for Present-Day Safety

For more on the relevance of SLT to safety, see our position paper (co-written with a larger group of authors). For more on the technical fundamentals of SLT, see Distilling Singular Learning Theory by Liam Carroll.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签