少点错误 07月01日 12:52
SLT for AI Safety
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能对齐(Alignment)与模型能力(Capabilities)的关系,强调了训练数据在塑造模型行为中的关键作用。文章指出,当前对齐方法主要依赖于数据工程,但随着模型能力的增强,可能需要更深层次的科学突破。文章重点介绍了奇异学习理论(SLT)在理解深度学习和提升对齐方面的潜力,认为SLT为解释模型内部结构、预测泛化行为以及通过控制训练环境来对齐模型提供了新的框架。

💡 **数据驱动的对齐与能力:** 文章开篇指出,在2025年,用于对齐模型的方法与提升模型能力的方法本质上没有区别,两者都基于深度学习,主要区别在于训练数据的选择。训练数据是区分对齐与能力的主要因素。

⚙️ **对齐是数据工程:** 当前的对齐方法可以被视为数据工程,通过示例演示、偏好反馈或抽象原则来间接指定价值观。模型在这些不完全确定的规范下进行训练,并期望它们内化我们预期的约束。

⚠️ **对齐可能需要科学突破:** 文章认为,随着模型能力的增强,现有的间接和经验性的对齐方法可能不足以确保安全。泛化问题和学习问题是深度学习中潜在的挑战,可能导致模型行为偏差或出现意外变化。

🧐 **奇异学习理论(SLT)的潜力:** 奇异学习理论(SLT)是一种贝叶斯统计理论,认为理解深度学习的关键在于损失函数和参数-函数映射的几何结构。SLT提供了一个理解学习算法、泛化以及训练数据如何决定这些算法的起点。

🔬 **SLT在可解释性和对齐中的应用:** 如果能够解读损失函数的几何结构,并将其与泛化联系起来,那么就可以获得一个解释学习算法的框架,并通过控制训练环境来对齐模型。SLT为提高可解释性和对齐提供了一个严谨的框架。

Published on July 1, 2025 4:52 AM GMT

This sequence draws from a position paper co-written with Simon Pepin Lehalleur, Jesse Hoogland, Matthew Farrugia-Roberts, Susan Wei, Alexander Gietelink Oldenziel, Stan van Wingerden, George Wang, Zach Furman, Liam Carroll, Daniel Murfet. Thank you to Stan, Dan, and Simon for providing feedback on this post. 

Alignment  Capabilities. As of 2025, there is essentially no difference between the methods we use to align models and the methods we use to make models more capable. Everything is based on deep learning, and the main distinguishing factor is the choice of training data. So, the question is: what is the right data?

Figure 1: Data differentiates alignment from capabilities. Deep learning involves three basic inputs: (1) the architecture (+ loss function), (2) the optimizer, and (3) the training data. Of these, the training data is the main variable that distinguishes alignment from capabilities. 

Alignment is data engineering. Alignment training data specifies our values indirectly in the form of example demonstrations (instruction/safety fine-tuning), binary preferences (RLHF, DPO), or a constitution of abstract principles (constitutional AI, deliberative alignment). We train models against those underdetermined specifications and hope they internalize our intended constraints. 

Alignment may require fundamental scientific progress. As we've seen with recent examples of sycophancy and reward-hacking, our current indirect and empirical approach to alignment might not be enough to align increasingly powerful systems. These symptoms appear to stem from deeper scientific problems with deep learning:

    The Problem of Generalization: Distribution shifts can break learning-based constraints, so two models that behave similarly with respect to the training specification may generalize very differently in deployment. In particular, behaviors associated with general-purpose capabilities may generalize more robustly than alignment properties (the "sharp left turn").The Problem of Learning: Even on a fixed distribution, the training process may learn an unintended solution. Our learning processes are noisy, heterogeneous, and not guaranteed to converge to an optimal solution for a given set of constraints. As a result, models may learn a dangerous simplification of our specifications, acquire unintended instrumental strategies (deception, incorrigibility, power-seeking, sandbagging, etc.), or suddenly change in ways that are hard to predict (and execute a "treacherous turn"). 

Past a certain level of capabilities, making further safety progress may require making scientific progress on some of these more fundamental questions. It is hard to align what you do not understand, and we still do not understand deep learning. 

Figure 1: Key questions for a science of AI safety. Many practical questions in AI safety are grounded in fundamental scientific questions about deep learning: 
1a. (Learning) How does training data determine the algorithms that models learn?
1b. (Alignment) How can we choose training data to control what algorithms models learn?
2a. (Generalization) How do learned algorithms generalize (under distribution shift)?
2b. (Interpretability) How do a model internals enable (mis)generalization? 

Singular learning theory for a science of alignment. Singular Learning Theory (SLT) is a theory of Bayesian statistics that suggests the key to understanding deep learning is the geometry (specifically, the degeneracies) of the loss function and parameter-function map. SLT provides a starting point for understanding how learned algorithms underlie generalization and how training data determines those learned algorithms:

    Interpretability: The loss landscape around a given set of weights reflects the algorithms encoded by that set of weights and the sensitivity of those algorithms to changes in the data distribution [1, 4, 6]. If we can develop (a) the empirical tools to read this geometry and (b) the theoretical understanding to "invert" this reflection (a hypothesis we call "Structural Bayesianism"  [6]), then we gain a principled framework for interpreting learned algorithms and linking this structure to generalization.Alignment: The algorithms a model learns are ultimately rooted in the model's training data: (a) Statistical structure in training data determines (b) the geometric structure in the loss landscape, which determines (c) the developmental structure in the learning process, which determines (d) the algorithmic structure in the final weights a model has learned. If we can (partially) "invert" this mapping (a hypothesis we call the "S4 correspondence", for four different kinds of "structure" [3, 8]), then we can transform constraints on learned algorithms into interventions on training data. This provides a framework for aligning models through careful control of the training environment.

We expect that SLT is currently the best bet for developing a more complete science of deep learning and alignment. If successful, this would enable not just a better understanding of the theoretical problems of generalization and learning, but also the development of a new set of principled tools for "reading" and "writing" model internals. 

This progress would not be enough to guarantee safety on its own and would increase the risk burden on other areas in safety. However, fundamental progress may be necessary if we aren't on track to solve (inner) alignment by default.

SLT for AI safety. The loss landscape is where the training data, architecture, and optimizer interact. We expect that understanding the geometry of this landscape is equivalent to understanding internal structure, how it generalizes, and how to reliably control what structures arise. This enables a rigorous framework for advancing interpretability and alignment.

Outline. The rest of this sequence will survey recent progress in SLT [1, 2, 3, 4, 5, 6, 7, 8] and explore the potential applications for interpretability, alignment, and other areas of safety discussed above:

For more on the relevance of SLT to safety, see our position paper (co-written with a larger group of authors). For more on the technical fundamentals of SLT, see Distilling Singular Learning Theory by Liam Carroll.  



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能对齐 深度学习 奇异学习理论 模型安全
相关文章