少点错误 7小时前
Zoom Out: Distributions in Semantic Spaces
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文提出了一种新的机制可解释性(MI)范式,不同于传统的基于神经元连接或电路的解释方式。该范式将网络层级的输入或输出视为语义空间,每个空间中包含不同的分布。文章详细介绍了输入空间和输出空间的语义,以及潜在语义空间如何作为两者之间的桥梁。作者认为,网络是通过语义映射来工作的,每一层都是对输入语义空间的变换,将其映射到新的输出语义空间。这种范式为理解概念、特征和意义提供了新的视角,认为概念是语义空间中特定位置的边界,特征是潜在语义空间中帮助我们理解语义空间转换的方面,而意义则存在于我们感兴趣的语义空间的语义中。

🌐 每个网络层级的输入或输出被视为一个包含不同分布的语义空间,而非单个信号。输入空间通常包含数据样本的分布,如猫和狗的图像分布,而输出空间则包含用于分类或标注的语义分布。

📊 潜在语义空间可以被理解为输入和输出语义空间之间的中间步骤,它们可能涉及分布的“解缠”或其他语义转换,展示了网络如何处理和变换信息。

🎯 网络通过语义映射工作,每一层都是对输入语义空间的仿射变换,如旋转、滑动、缩放和倾斜,将信息“折叠”、“弯曲”、“压缩”或“挤压”到负正交象限中。这种变换使得输入分布中的任何样本都被映射到新的压缩分布中,从而形成新的语义空间。

🔍 这种范式重新定义了概念、特征和意义:概念是语义空间中特定位置的边界或边界集合;特征是潜在语义空间中帮助理解语义空间转换的方面;意义则存在于我们感兴趣的语义空间的语义中,如猫图像的分布及其子集。

Published on August 6, 2025 12:01 AM GMT

(This article is edited and expanded on from a comment I made to someone in the newly starting BAIF slack community. Thanks for the inspiration 🙏)

Introduction

In this article I present an alternative paradigm for Mechanistic Interpretability (MI). This paradigm may turn out to be better or worse, or naturally combine with the standard paradigm I often see implicitly extended from Chis Olah's "Zoom-In".

I've talked about this concept before, in various places. Someday I may collect them and try to present a strong case including a survey of paradigms in MI literature. For now, here is a relatively short introduction to the concept assuming some familiarity with ML and MI.

Afaik, Chris Olah originally introduced the concepts of "features" and "circuits" in "Zoom-In", as a suggestion for a direction for exploration, not as a certainty. It worked very well for thinking about things like "circle" and "texture" detectors, which I think are a natural, but incorrect way of understanding what is going on.

New Mechanistic Interpretability Paradigm?

I have been developing an alternate paradigm I'm not currently sure anyone else is talking about.

It is now common to think of the collective inputs or outputs of network layers as vectors rather than individual signals. The concept which I am uncertain anyone is focusing on, is that each vector is representative of a semantic space in which distributions live.

Input Space

For example, in a cat-dog-labeling net, the input space is images and there are two distributions living in this space. The cat-distribution is all possible images that are of cats. We can make some claims about that distribution, such as the idea that it is continuous and connected. The same thing is true of the dog-distribution, but additionally, the dog-distribution may be connected to the cat distribution in several spaces containing the set of images that are ambiguous, maybe a cat, maybe a dog. There is also implicitly a distribution of images that are neither dogs nor cats, but this can be ignored in simple examples.

Output Space

The output space has very different semantics. It is meant to label images as either a cat or dog, so it may be a 1-dim space where being near (1,) corresponds to "highly cat like" and near (0,) corresponds to "highly dog like" and anything in between could be "ambiguous" or "neither cat nor dog". If there is also training for "not cat or dog" the space might be 2-dim, with the point (1,0) meaning cat-like, and (0,1) meaning dog-like. Then "neither" would be (0,0) and "ambiguous" would be (0.5, 0.5). This kind of semantic space seems somehow intuitive to me. If you gave me a set of pictures to post on a cork-board I feel like I could do a fair approximation of this.

Details of Semantics

It seems note worthy that the actual input and output semantic spaces as understood by the network may be different from these semantic understandings based on the training dataset, with the possible distributions it could imply, and the dynamics of the training. For example, if the training puts no constraint on mapping inputs to locations like (10,0) or (-1,0) in the output, then what would the semantics of those locations mean? Would (10,0) correspond to something like "10 times as cat like"? My intuition is that there would instead be a messy distribution extending out in the (1,0) direction, and that distribution would be determined by the shape of the input distribution and the network architecture and training dynamics. In other words, it would be a result of whatever was the easiest way to separate the parts of the input distribution that need to be separated. The same applies to the (-1,0) direction. I do not expect this to have semantics meaning "the opposite of cat like". There may be something similar, especially if unsupervised rather than label methods are used, but it still has to do with the semantic distribution, not semantic directions.

Latent Semantic Spaces

Each of the latent spaces of the network could be understood as some step between the semantics of the two spaces. That might be an oversimplification, for example, there may be movement into a seemingly unrelated semantic space either to "untangle" distributions, or for some other reason. Even if it's not an oversimplification, there is a lot to be understood about what it means to step from the semantics of image-space to label-space.

Semantic Mappings

With this paradigm, the network is not understood in terms of neuron connections or circuits at all, but instead as "semantic mappings". Each layer of the network is an affine transformation (rotating, sliding, scaling, and skewing) that prepares it for the activation which "folds", "bends", "collapses", or "squashes" the parts of the space that have been moved into a negative orthant (generalization of a quadrant in n-dim). This also results in the squashing of any distributions that existed in that space. In other words, any inputs sampled from the distribution of possible inputs gets mapped to the corresponding location in the new squashed distribution. This results in the new, slightly transformed semantic space as the output of the layer.

The goal given to the network during it's training is to find the semantic mappings that transform the input semantic space (defined by the input dataset) into the output semantic space (defined by the labels to the dataset). What I think we have found, empirically, with neural networks, is that they are up to this task. They can, through a sequence of squashings, transform semantic spaces into very different looking semantic spaces.


So from within this paradigm, the answer to the questions:


If you finished reading this. Thanks!

Let me know what you think, and if you know of any work that seems related, please send me a link.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机制可解释性 语义空间 神经网络 语义映射
相关文章