JumpReLU SAEs + Early Access to Gemma 2 SAEs

少点错误 2024年07月20日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

JumpReLU SAEs 是一种新的自编码器架构，它使用不连续的 JumpReLU 激活函数来替换标准的 ReLU，在保持可解释性的同时，在给定稀疏性水平下实现了最先进的重建精度。这项研究由 Google DeepMind 的机械可解释性团队领导，并将在未来几周内发布数百个 JumpReLU SAEs 的权重，这些权重在 Gemma 2 2B 和 9B 模型的每个层和子层上训练。

🤔 JumpReLU SAEs 是一种新的自编码器 (SAE) 架构，它使用不连续的 JumpReLU 激活函数来替换标准的 ReLU，在保持可解释性的同时，在给定稀疏性水平下实现了最先进的重建精度。与其他最新方法（如 Gated 和 TopK SAEs）相比，JumpReLU SAEs 在 Gemma 2 9B 激活上实现了最先进的重建保真度。

🚀 JumpReLU SAEs 是一种简单而有效的 SAE 扩展，它使用直通估计器 (STE) 来训练 JumpReLU SAEs，即使它在 SAE 的前向传递中引入了不连续的 JumpReLU 函数。此外，STE 用于直接训练 L0 以使其稀疏，而不是像 L1 那样在代理上训练，从而避免了收缩等问题。

💡 JumpReLU SAEs 在各种场景中都优于 Gated SAEs，它可以被视为 Gated SAEs++，但训练成本更低，性能更好。JumpReLU SAEs 可以使用现有的 Gated 实现运行。

🔓 在未来几周内，研究人员将发布数百个 JumpReLU SAEs 的权重，这些权重在 Gemma 2 2B 和 9B 模型的每个层和子层上训练。这将使研究人员能够更容易地使用这些模型进行各种研究项目。

🤝 研究团队希望通过发布这些权重，能够让更多研究人员使用这些模型，并期待从社区获得反馈，从而推动语言模型可解释性研究的进一步发展。

Published on July 19, 2024 4:10 PM GMT

New paper from the Google DeepMind mechanistic interpretability team, led by Sen Rajamanoharan!

We introduce JumpReLU SAEs, a new SAE architecture that replaces the standard ReLUs with discontinuous JumpReLU activations, and seems to be (narrowly) state of the art over existing methods like TopK and Gated SAEs for achieving high reconstruction at a given sparsity level, without a hit to interpretability. We train through discontinuity with straight-through estimators, which also let us directly optimise the L0.

To accompany this, we will release the weights of hundreds of JumpReLU SAEs on every layer and sublayer of Gemma 2 2B and 9B in a few weeks. Apply now for early access to the 9B ones! We're keen to get feedback from the community, and to get these into the hands of researchers as fast as possible. There's a lot of great projects that we hope will be much easier with open SAEs on capable models!

Gated SAEs already reduced to JumpReLU activations after weight tying, so this can be thought of as Gated SAEs++, but less computationally intensive to train, and better performing. They should be runnable in existing Gated implementations.

Abstract:

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model’s (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse – two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs – where we replace the ReLU with a discontinuous JumpReLU activation function – and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE’s forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签