少点错误 01月08日
What are polysemantic neurons?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了神经网络中多义神经元现象,即单个神经元响应多个不相关特征。与单义神经元只对单一特征激活不同,多义神经元由于“叠加”现象,将多个特征组合在有限的神经元中,提高了网络容量。然而,这使得理解神经网络的内部机制变得复杂,因为单个神经元控制多个特征间的连接。尽管如此,Anthropic公司在2023年提出了通过稀疏自编码器将多义神经元分解为更易解释的单义神经元的方案,为解决该问题提供了新思路,但其有效性仍存在争议。

🧠 多义神经元是指单个神经元可以对多种不相关的特征产生强烈反应,例如同时对猫的脸、猫的腿和汽车前部激活。

💡 多义神经元的出现源于“叠加”现象,神经网络为了在有限的神经元中容纳更多特征,会将多个特征组合在单个神经元中,从而提高网络容量。

🚧 多义神经元使得理解神经网络内部机制变得更加困难,因为它们编码多个特征,单个神经元的行为会影响多个特征的连接,限制了我们对神经网络的推理能力。

🔬 虽然Anthropic提出了使用稀疏自编码器将多义神经元分解为单义神经元的方法,但其有效性仍有待进一步验证,目前仍是一个活跃的研究领域。

Published on January 8, 2025 7:35 AM GMT

This is an article in the featured articles series from AISafety.info. AISafety.info writes AI safety intro content. We'd appreciate any feedback

The most up-to-date version of this article is on our website, along with 300+ other articles on AI existential safety.

For a “monosemantic” neuron, there’s a single feature that determines whether or not it activates strongly. If a neuron responds only to grandmothers, we might call it a grandmother neuron. For a “polysemantic” neuron, in contrast, there are multiple features that can cause it to activate strongly.

As an example, this image shows feature visualizations of a neuron that activates when it sees either a cat face, cat legs, or the front of a car. As far as anyone can tell, this neuron is not responding to both cats and cars because cars and cats share some underlying feature. Rather, the neuron just happened to get two unrelated features attached to it.

How do we know that the neurons are not encoding some sort of shared similarity?

Suppose a polysemantic neuron is picking out some feature shared by cars and cats. Say, the neuron is representing “sleekness”. Then we’d expect images of other “sleek” things, like a snake or a ferret, to activate the neuron. So if wegenerate lots of different images which highly activate our neuron, and find that they do contain snakes and ferrets, that’s evidence for the neuron picking up on a unified concept of sleekness. Researchers have run experiments like this on neurons like this one and found that, no, they just activate on cats and cars — just as the “polysemantic” hypothesis would lead us to expect.

Why do polysemantic neurons form?

Polysemantic neurons seem to result from a phenomenon known as “superposition”. Superposition means a neural net combines multiple features in individual neurons, so that it can pack more features into the limited number of neurons it has available, or use fewer neurons, conserving them for more important tasks. In fact, if we only care about packing as many features into n neurons as we can, then using polysemantic neurons lets us pack roughly as many as exp(C * n) features, where C is a constant depending on how much overlap between concepts you allow.[1] In contrast, using monosemantic neurons would only let you pack in n features.

What are the consequences of polysemantic neurons arising in networks?

Polysemantic neurons are a major challenge for the “circuits” agenda because they limit our ability to reason about neural networks. Because they encode multiple features, it’s harder to interpret these neurons individually when thinking about circuits. As an example: if we only have two polysemantic neurons, which encode five different features each, then we have effectively 25 different connections between features that are all governed by a single weight.

However, there has been some recent progress. In 2023, Anthropic claimed to achieve a breakthrough in this problem in their paper “Towards Monosemanticity”. Anthropic trained large sparse autoencoders to decompose the polysemantic neurons in a neural network into a larger number of monosemantic neurons, which are (claimed to) be more interpretable.

Christopher Olah of Anthropic stated he is “now very optimistic [about superposition]”, and would “go as far as saying it’s now primarily an engineering problem — hard, but less fundamental risk.” Why did we caveat Anthropic’s claims? Because some researchers, like Ryan Greenblatt, are more skeptical about the utility of sparse autoencoders as a solution to polysemanticity.

  1. ^

    This is a consequence of the Johnson-Lindenstrauss lemma. As this estimate doesn’t account for using the exponential number of features for useful computations, it is unclear if neural networks actually achieve this bound in practice. (The use of polysemanticity in computations is an active research area. For a model of how polysemanticity aids computations, see “Towards a Mathematical Framework for Computation in Superposition”.) What about lower bounds on how much computation can be done with polysemantic neurons? Well, these estimates depend on assumptions about the training data, number of concepts, initialization of weights etc. So it is hard to give a good lower bound in general. But for some cases, we do have estimates: e.g., “Incidental polysemanticity” notes that, depending on the ratio of concepts to neurons, the initialization process can lead to a constant fraction of polysemantic neurons.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多义神经元 叠加 稀疏自编码器 神经网络 AI安全
相关文章