少点错误 04月01日 04:57
Opportunity Space: Renormalization for AI Safety 
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了将神经网络(NN)视为复杂统计系统,并利用重整化技术来提升其可解释性的方法。文章从数据生成过程、神经网络激活和学习过程三个方面分析了NN的特性,提出了支持研究的三个核心目标,包括确定有效的重整化方法、明确应用场论原理所需的正式程度、以及识别对推进NN可解释性至关重要的因素。文章还详细介绍了两个研究项目,分别侧重于隐式重整化和显式重整化技术,旨在探索NN中的尺度、特征提取和因果分解,最终提高AI系统的可解释性和安全性。

🧠 将神经网络视为统计系统:文章提出将神经网络的不同方面,包括数据生成过程、神经网络激活和学习过程,都视为复杂的统计系统。这种视角为应用物理学中的重整化技术提供了理论基础。

🔍 核心研究目标:文章明确了三个核心研究目标,分别是确定有效的重整化方法、明确应用场论原理所需的正式程度、以及识别对推进NN可解释性至关重要的因素,为后续研究提供了方向。

⚙️ 隐式重整化研究:该项目旨在探索NN中“自然”的尺度和重整化方法,并构建一个连贯的重整化框架,特别是针对AI安全。研究重点包括识别NN中的尺度、理解不同方法的优缺点、以及研究关键行为与普适性之间的关系。

💡 显式重整化研究:该项目旨在开发无监督的显式重整化技术,以识别NN中的特征。目标是开发出能够优于现有可解释性工具(如SAEs)的特征提取方法,从而实现对AI系统更优的解释。

🔬 研究项目与应用:文章详细介绍了两个研究项目,分别关注隐式重整化和显式重整化技术,并探讨了它们在不同场景下的应用,例如多智能体系统和大型推理模型。

Published on March 31, 2025 8:55 PM GMT

This opportunity space was developed with Dmitry Vaintrob and Lucas Teixeira as part of PIBBSS' horizon scanning initiative. A detailed roadmap of can be found here

Background

The basic premise is to view different aspects of a neural network (NN) as a complex statistical system, as follows:

The data-generating process

A NN depends on empirically obtained labeled datasets drawn from some "ground truth" distribution about which the NN must learn to make predictions. By modeling the data-generating process as a physical system (e.g., a Gaussian field theory), we can use renormalization to reason about the ontology of the dataset independently from the NN representation. 

NN activations

We can view a NN itself as a statistical system, which transforms an input to an output via a series of intermediate activations. While a trained neural net is deterministic, interpretability seeks suitably “simple”, “sparse”, or otherwise ‘interpretable’ explanations, by coarse-graining the full information about the NN’s activation into a smaller number of “nicer” summary variables, or features. This process loses information, sacrificing determinism for ‘nice’ statistical properties. We can now try to implement renormalization by introducing notions of locality and coarse-graining on the neural net inputs and activations and looking for relevant features via an associated RG flow. While interpreting a trained model via renormalization is an explicit process, we note that this is done implicitly in diffusion models (e.g. Wang et al., 2023), which sequentially extract coarse-to-fine features at various noise scales.

The learning process

Because it is randomly initialized in the weight landscape, learning is an inherently stochastic process controlled by an architecture, which outputs a function of an input data distribution implemented by weights.  

Of all the processes discussed here, this one most looks like a “physical” statistical theory; in certain limits of simple systems (Lee et al., 2023), this process is very well-described by either a free (vacuum) statistical field theory or a computationally tractable perturbative theory controlled by Feynman diagram expansions. Though these approximations fail in more realistic, general cases (Lippl & StachenfeldPerin & Deny), we nevertheless expect them to hold locally in certain contexts and for a suitable notion of locality. This is analogous to  and directions, in the same way that experiments on activation addition show that, while interesting neural nets are not linear, they behave “kind of linearly” in many contexts (Turner et al. Shoenholz et al., 2016Lavie et al., 2024Cohen et al., 2019). 

Open Programmes

We will support research projects in line with three core aims: 

    To determine which renormalization methods, assumptions, and techniques, including numerical RG, functional RG, and real-space RG, are most effective in various neural network contexts. We expect to need different implementations for implicit and explicit renormalization, but also for different ‘phases’ of training and inference. We value projects that are clear about the epistemic ‘baggage’ associated with techniques from different physics disciplines, including the value they add and limitations they bring. To clarify the degree of formalism needed to apply field theoretic principles and renormalization techniques to AI systems. This includes rigorous definitions of "theoretical tethers" and the development of a coherent renormalizability framework grounded in fixed points, critical phenomena, universality, and Gaussian limits. To identify ‘difference making factors’ critical to making renormalization useful for advancing NN interpretability and aligning theoretical models with empirical observations.

Projects should also be in scope of one of the following programmes. 

Programme 1: Model organisms of implicit renormalization: Relating and comparing different notions of scale and abstraction

Previous work (Roberts et al. 2021Berman et al. 2023Erbin et al. 2022Halverson et al. 2020Ringel et al. 2025) hints at ‘natural’ notions of scale and renormalization in NNs, but – like in physics – there is no one ‘right’ way to operationalize the array of tools and techniques we have at hand. This programme aims to probe the respective regimes of validity of different approaches so they can be built into a coherent renormalization framework for AI safety. By engineering situations in which renormalization has a ‘ground truth’ interpretation, we seek comprehensive theoretical and empirical descriptions of NN training and inference. Where current theories fall short, we aim to identify the model-natural (instead of physical) information needed for renormalization to provide a robust, reliable explanation of AI systems. In addition to physics, this research may also bring insights from fields like neuroscience and biology to inform our understanding of scaling behavior in AI systems. We would also be excited to support the development of an implicit renormalization 

A non-exhuastive list of projects in this programme may address: 

    Identifying ‘natural’ scales and corresponding measures of ‘closeness’ in toy NNs, and the relationship between network scales and emergence of features or capabilities. For example, you may operationalize the connection between scale and 
      Token position and “information space geometry” in sequence-generating models (e.g. Marzen et alShai et al.).Noise scales in diffusion models (Sclocchi et al. 2024).Extending multi-scale formulations of RG (Rubin et al.).Reinforcement Learning (RL) paradigms or multi-agent systems. 
    Understand how relevant features (for example, in the way of Gordon et al.) build up to specific capabilities. Creating novel architectures with a more interpretable notion of scale and coarse-graining.Exploring similarities and respective strengths and weaknesses of different approaches in these settings. For example Wilsonian RG v. real-space RG, variational RG (Mehta et al.)Identifying and characterizing critical behavior and how it relates to universality in NNs. For example: 
      Universality across architectures (e.g., LLMs, diffusion models), datasets, or different feature representations. The difference between critical behavior during training (Bukva et al.Fischer et al.) v. inference (Aguilera et al.Howard et al..Critical points as (possibly idealized) ‘tethers’ along an RG flow (e.g., Gaussian or non-Gaussian fixed points like Erdminger et al. and Demirtas et al.))
    Critically evaluating the assumption of Gaussian behavior, for example:
       in network dynamics, building on work like Berman et al.. Its relation to the Linear Representation Hypothesis. In other words: Are NNs ‘kind of Gaussian’ in the same way that they are ‘kind of linear’? 
    Testing ‘separation of scales’ as conditional causal independence between fine-grained and coarse-grained phenomena.

You may be a good fit for this program if you have: 

Programme 2: Development of unsupervised Explicit Renormalization techniques to identify features in NNs

Inspired by existing work (Fischer et al. 2024Berman et al. 2023), we think that explicit renormalization can be used to find features which are nonlinear, principled and causally decoupled with respect to the computation. The ultimate goal of this programme is to operationalize renormalization for optimally interpreting the statistical system representing the AI’s reality. Resulting techniques should be capable of finding unsupervised features that perform better than state-of-the-art interpretability tools like SAEs (Anders et al.). 

The general problem with causal decomposition remains the question of the extraction of principled features. While SAE’s are a particularly interpretable and practical unsupervised technique to get interesting (linear) features of activations, they are not optimized to provide a complete decomposition into causal primitives of computation – SAEs do not give the “correct” features in general(Mendel 2024). When interpreting more sophisticated behaviors than simple grammar circuits, SAEs by themselves are simply not enough to give a strong interpretation, including a causal decomposition (Leask et al., 2025).

However, SAEs may be, for explicit renormalization, what early renormalization was for RG – an ad-hoc approach to ‘engineer’ away unphysical divergences that nevertheless laid the foundation for a theoretical formalism. We see them – and the associated LRH – as a 1st-order ansatz from which to build. 

Projects in this programme may: 

    Address whether idealized theoretical frameworks (like the NN-QFT) is a sufficient heuristic for guiding operational techniques for feature abstraction. Measure the comparative advantage of this to SAEs. Think critically about how candidate feature scales are related to the notion of ‘human interpretability’.
      For example, is feature splitting related to a scale of abstraction that reflects an (un)principled method of semantic labeling, or can it be related to a model-natural notion of scale?
    Relate explicit renormalization to implicit renormalization, but developing good metrics for distinguishing between the two flows. Extend this framework to interpret multi-agent systems or Large Reasoning Models (LRMs). 

You might be a good fit for this programme if you have: 

For the Future

These programmes build on the first two, so their direction will be set at a later time. For now, we present a rough scope to set our intentions for future work.

Programme 3: Leveraging insights gained from implicit renormalization for ‘principled safety’ of AI systems 

Inspired by separation of scales between different effective field theories along an RG flow, this programme seeks to provide both theoretical justification and empirical validation for causal separation of scales in neural networks. This work depends on our ability to show that, under appropriate conditions, fine-grained behaviors can be conditionally isolated based on coarser scales—thereby enhancing our capacity to design AI systems with principled safety guarantees. It also depends on a better operationalization of AI safety concepts like ‘deception’, to understand if complex features like deception are in fact separated from finer features in some scale. Another aim is to find candidate operationalizations of finding alternative RG flows with ‘safer’ critical points (with carefully defined metrics to measure this). 

Programme 4: Apply Field theory ‘in the wild’

This programme puts what we have learned in the first three programmes to work, testing our framework against empirical evidence of scale, renormalization, and other FT techniques as they naturally occur in large-scale real-world systems. The aim is to build more general evidence for, and theoretical development of, implicit renormalization in SOTA environments. A core goal would be to use the results to develop an analog of the NN-QFT correspondence far from Gaussianity. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

神经网络 重整化 可解释性 AI研究 机器学习
相关文章