MarkTechPost@AI 04月05日 13:47
This AI Paper Introduces a Short KL+MSE Fine-Tuning Strategy: A Low-Cost Alternative to End-to-End Sparse Autoencoder Training for Interpretability
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种针对稀疏自编码器(SAE)训练的创新方法,旨在提高大型语言模型(LLM)的可解释性。该方法通过在训练的最后阶段引入短时间的KL散度和均方误差(MSE)联合微调,显著提升了重建精度,同时避免了对模型架构的修改,并降低了计算成本。研究结果表明,该方法在多种模型和稀疏度设置下均表现出色,有效解决了训练目标与模型推理性能不匹配的问题,为LLM的分析和理解提供了新的工具。

🧠 稀疏自编码器(SAE)是分析大型语言模型(LLM)内部运作的关键工具,通过将复杂的内部状态转化为可解释的组件,帮助研究人员理解神经元的贡献,从而进行电路分析,确保模型行为的可信度。

💡 传统的SAE训练方法在训练目标与推理性能之间存在不匹配问题,主要体现在使用均方误差(MSE)进行训练,但并未针对交叉熵损失进行优化,导致重建效果在实际推理中表现不佳。全端到端训练虽然可以解决此问题,但计算成本高昂。

🚀 一种新的解决方案在训练的最后阶段,使用少量数据(2500万个tokens,仅占总训练数据的0.5%到10%)进行KL散度与MSE的联合微调。该方法避免了模型架构的修改,同时显著减少了计算量,节省高达90%的训练时间,并且无需额外的基础设施或算法更改。

⚙️ 该微调过程采用动态平衡机制,根据每个训练批次重新计算KL散度的缩放因子,确保总损失与原始MSE损失保持相同的尺度,从而简化了超参数调整,并方便在不同模型之间迁移。实验结果表明,该方法能将交叉熵损失差距降低20%到50%。

📈 尽管该方法在提升模型预测精度方面表现出色,但在可解释性指标上表现不一,这取决于具体的评估任务和模型架构。ReLU-based SAE在稀疏探测和RAVEL指标上有所改进,但对虚假相关性和目标探测任务的性能有所下降。TopK-based模型的变化则较小且不稳定。

Sparse autoencoders are central tools in analyzing how large language models function internally. Translating complex internal states into interpretable components allows researchers to break down neural activations into parts that make sense to humans. These methods support tracing logic paths and identifying how particular tokens or phrases influence model behavior. Sparse autoencoders are especially valuable for interpretability applications, including circuit analysis, where understanding what each neuron contributes is crucial to ensuring trustworthy model behavior.

A pressing issue with sparse autoencoder training lies in aligning training objectives with how performance is measured during model inference. Traditionally, training uses mean squared error (MSE) on precomputed model activations. However, this doesn’t optimize for cross-entropy loss, which is used to judge performance when reconstructed activations replace the originals. This mismatch results in reconstructions that perform poorly in real inference settings. More direct methods that train on both MSE and KL divergence solve this issue, but they demand considerable computation, which limits their adoption in practice.

Several approaches have attempted to improve sparse autoencoder training. Full end-to-end training combining KL divergence and MSE losses offers better reconstruction quality. Still, it comes with a high computational cost of up to 48× higher due to multiple forward passes and lack of activation amortization. An alternative involves using LoRA adapters to fine-tune the base language model around a fixed autoencoder. While efficient, this method modifies the model itself, which isn’t ideal for applications that require analyzing the unaltered architecture.

An independent researcher from Deepmind has introduced a new solution that applies a brief KL+MSE fine-tuning step at the tail end of the training, specifically for the final 25 million tokens—just 0.5–10% of the usual training data volume. The models come from the Gemma team and Pythia project. It avoids altering the model architecture and minimizes complexity while achieving performance similar to full end-to-end training. It also allows training time savings of up to 90% in scenarios with large models or amortized activation collection without requiring additional infrastructure or algorithmic changes.

To implement this, the training begins with standard MSE on shuffled activations, followed by a short KL+MSE fine-tuning phase. This phase uses a dynamic balancing mechanism to adjust the weight of KL divergence relative to MSE loss. Instead of manually tuning a fixed β parameter, the system recalculates the KL scaling factor per training batch. The formula ensures the total combined loss maintains the same scale as the original MSE loss. This dynamic control prevents the need for additional hyperparameters and simplifies transfer across model types. Fine-tuning is executed with a linear decay of the learning rate from 5e-5 to 0 over the 25M token window, aligning the process with practical compute budgets and preserving sparsity settings from earlier training.

Performance evaluations show that this approach reduced the cross-entropy loss gap by 20% to 50%, depending on the sparsity setting. For example, on Pythia-160M with K=80, the KL+MSE fine-tuned model performed slightly better than a full end-to-end model, requiring 50% less wall-clock time. At higher sparsity (K=160), the fine-tuned MSE-only model achieved similar or marginally better outcomes than KL+MSE, possibly due to the simplicity of the objective. Tests with LoRA and linear adapters revealed that their benefits do not stack, as each method corrects a shared error source in MSE-trained autoencoders. Even very low-rank LoRA adapters (rank 2) captured over half the performance gains of full fine-tuning.

Although cross-entropy results consistently favored the fine-tuned method, interpretability metrics showed mixed trends. On SAEBench, ReLU-based sparse autoencoders saw improvements in sparse probing and RAVEL metrics, while performance on spurious correlation and targeted probe tasks dropped. TopK-based models showed smaller, more inconsistent changes. These results suggest that fine-tuning may yield reconstructions better aligned with model predictions but may not always enhance interpretability, depending on the specific evaluation task or architecture type.

This research underscores a meaningful advancement in sparse autoencoder training: a computationally light, technically simple method that improves reconstruction accuracy without modifying base models. It addresses key alignment issues in training objectives and delivers practical results across models and sparsity levels. While not uniformly superior in all interpretability metrics, it offers a favorable trade-off between performance and simplicity for tasks like circuit-level analysis.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post This AI Paper Introduces a Short KL+MSE Fine-Tuning Strategy: A Low-Cost Alternative to End-to-End Sparse Autoencoder Training for Interpretability appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

稀疏自编码器 大语言模型 模型可解释性
相关文章