少点错误 05月20日 13:22
SAE vs. RepE
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

xAI的Dan Hendrycks发文批评Anthropic过度关注稀疏自编码器(SAE)在机制可解释性方面的应用。他认为SAE在实际应用中面临挑战,且收益甚微,建议将重心放在表征工程(RepE)上。RepE关注模型内部的表征模式,而非单个神经元,通过操纵这些表征来实现对模型的控制和安全。文章作者认为,SAE和RepE并非互斥,两者都可能是有价值的工具。尽管SAE面临挑战,但已有一些鼓舞人心的结果,例如Golden Gate Claude和电路的“钳制”与“加权”。作者希望引发关于SAE是否仍值得投入的讨论。

🤔 Dan Hendrycks批评Anthropic对稀疏自编码器(SAE)的关注,认为SAE在实际应用中难以稳定工作,且在检测有害意图方面表现不如简单的基线方法,暗示过去十年在可解释性方面的投入回报甚微。

💡 表征工程(RepE)作为一种新兴领域,强调分析模型中跨多个神经元的活动模式,而非孤立地分析神经元或电路。RepE的优势在于,即使移除模型的部分层,其整体行为通常保持不变,这表明神经元间的组织关系比单个组件更重要。

🛠️ RepE能够识别、放大和抑制模型的特定特征,从而实现对模型的控制和安全。通过RepE,可以使模型遗忘双重用途概念、提高诚实度、增强对对抗性攻击的抵抗力,甚至编辑AI的价值观。

🚀 尽管Hendrycks建议将重点放在RepE上,但文章作者认为SAE并非完全没有价值。例如,Golden Gate Claude项目以及对LLM电路进行“钳制”和“加权”的实验都表明,SAE在某些概念的可解释性方面具有潜力。

Published on May 20, 2025 5:09 AM GMT

I recently read a post by Dan Hendrycks from xAI criticizing Anthropic's focus on Sparse Auto-Encoders as a tool for mechanistic interpretability.

You can find that post HERE. Some salient quotes below.

On SAEs:

Another technique initially hailed as a breakthrough is sparse autoencoders (SAEs) [...] getting SAEs to work reliably has proven challenging, possibly because some concepts are distributed too widely across the network, or perhaps because the model’s operations are not founded on a neat set of human-understandable concepts at all. This is the field that DeepMind recently deprioritized, noting that their SAE research had yielded disappointing results. In fact, given the task of detecting harmful intent in user inputs, SAEs underperformed a simple baseline.

[...] despite substantial efforts over the past decade, the returns from interpretability have been roughly nonexistent. To avoid overinvesting in ideas that are unlikely to work, potentially to the neglect of more effective ones, we should be more skeptical in the future about heavily prioritizing mechanistic interpretability over other types of AI research.

On RepE:

 

Representation engineering (RepE) is a promising emerging field that takes this view. Focusing on representations as the primary units of analysis – as opposed to neurons or circuits – this area finds meaning in the patterns of activity across many neurons.

A strong argument for this approach is the fact that models often largely retain the same overall behaviors even if entire layers of their structure are removed. As well as demonstrating their remarkable flexibility and adaptability – not unlike that of the brain – this indicates that the individual components in isolation offer far fewer insights than the organization between them. In fact, because of emergence, analyzing complex systems at a higher level is often enough to understand or predict their behavior, while detailed lower-level inspection can be unnecessary or even misleading.

RepE can identify, amplify, and suppress characteristics. RepE helps manipulate model internals to control them and make them safer. Since the original RepE paper, we have used RepE to make models unlearn dual-use concepts, be more honest, be more robust to adversarial attacks, edit AIs’ values, and more.

I'm not sure why the article seems to be taking the framing that SAEs and RepE cannot co-exist as safety methods, if I were taking the most charitable interpretation of his point I think that Dan is arguing that investment should focus on RepE over SAEs.

However from my perspective there have been some encouraging results with SAEs. Golden Gate Claude, or even just being able to "clamp" and/or "upweight" circuits like we saw in the Dallas example of biology of an LLM would both seem to indicate that SAEs features are on the right track to interpretability for at least some concepts. Ultimately I don't see why RepE and SAEs can't both be valuable tools.

I would like to hear what other people think about this criticism. Is it valid and focus should no longer be put into SAEs, or is there enough "there there" that it's still an avenue worth pursuing?



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

稀疏自编码器(SAE) 表征工程(RepE) 可解释性 人工智能安全
相关文章