少点错误 2024年09月25日
[Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文研究语言模型中SAE提取特征信息的情况,发现SAE潜在分类器不可靠,存在特征吸收现象,影响模型的可解释性,并探讨了相关问题及未来研究方向。

🎯SAE潜在分类器看似能对输入的某些特征进行分类,但实际上是相当不可靠的分类器,比线性探测器差很多,这种不可靠性部分源于实际需求和训练目标的差异。

💡提出两个问题:SAE从LLM中提取可解释潜在特征的程度如何;SAE超参数的变化如何影响其可解释性。通过在简单任务上测试SAE性能,发现了特征吸收这一新障碍。

🚧特征吸收是一种有害的、不对称的特征分裂形式,一个SAE潜在特征看似跟踪人类可解释的概念,但在某些看似应激活的示例上却未激活,大致与标记对齐的潜在特征会吸收特征方向并代替主线潜在特征激活。

❌特征吸收带来诸多问题,如解释了特征电路为何不能稀疏,超参数调整不太可能完全消除吸收,且在特征共同出现时,稀疏性可能导致可解释性降低。

Published on September 25, 2024 9:31 AM GMT

This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by Joseph Bloom (Decode Research). Find out more about the programme and express interest in upcoming iterations here.

This high level summary will be most accessible to those with relevant context including an understanding of SAEs. The importance of this work rests in part on the surrounding hype, and potential philosophical issues. We encourage readers seeking technical details to read the paper on arxiv.

TLDR: This is a short post summarising the key ideas and implications of our recent work studying how character information represented in language models is extracted by SAEs. Our most important result shows that SAE latents can appear to classify some feature of the input, but actually turn out to be quite unreliable classifiers (much worse than linear probes). We think this unreliability is in part due to difference between what we want actually want (an "interpretable decomposition") and what we train against (sparsity + reconstruction). We think there are many possibly productive follow-up investigations. 

We pose two questions:

    To what extent do Sparse Autoencoders (SAEs) extract interpretable latents from LLMs? The success of SAE applications (such as detecting safety-relevant features or efficiently describing circuits) will rely on whether SAE latents are reliable classifiers and provide an interpretable decomposition. How does varying the hyperparameters of the SAE affect its interpretability? Much time and effort is being invested in iterating on SAE training methods, can we provide a guiding signal for these endeavours? 

To answer these questions, we tested SAE performance on a simple first letter identification task using over 200 Gemma Scope SAEs. By focussing on a task with ground truth labels we precisely measured the precision and recall of SAE latents tracking first letter information. 

Our results revealed a novel obstacle to using SAEs for interpretability which we term 'Feature Absorption'. This phenomenon is a pernicious, asymmetric form of feature splitting where:

Feature splitting vs feature absorption:

Feature Absorption is problematic:

While our study has clear limitations, we think there’s compelling evidence for the existence of feature absorption. It’s important to note that our results use a leverage model (Gemma-2-2b),  one SAE architecture (JumpReLU) and one task (the first letter identification task). However, the qualitative results are striking (which readers can explore for themselves using our streamlit app) and causal interventions are built directly into our metric for feature absorption.

We’re excited about future work in a number of areas (in order of concrete / import to most exciting!):

    Validating feature absorption.
      Is it reduced in different SAE architectures? (We think likely no). Does it appear for other kinds of features? (We think likely yes). 
    Refining feature absorption metrics such as removing the need to train a linear probe.Exploring strategies for mitigating feature absorption
      We think Meta-SAEs may be a part of the answer.We’re also excited about fine tuning against attribution sparsity with task relevant metrics/datasets. 
    Building better toy models and theories. Most prior toy models assume independent features, but can we construct “Toy Models of Feature Absorption” with features that co-occur together, and subsequently construct SAEs which effectively decompose them? (such as via the methods described in point 3).

We believe our work provides a significant contribution by identifying and characterising the feature absorption phenomenon, highlighting a critical challenge in the pursuit of interpretable AI systems.

  1. ^


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SAE 特征吸收 可解释性 语言模型
相关文章