On the transferability of Sparse Autoencoders for interpreting compressed models

cs.AI updates on arXiv.org 07月23日 12:03

On the transferability of Sparse Autoencoders for interpreting compressed models

本文探讨了稀疏自动编码器（SAEs）在原始和压缩模型中的差异，发现SAEs能解释压缩模型，但性能略逊于在压缩模型上训练的SAEs。简单剪枝原始SAEs也能达到类似效果，降低SAEs的训练成本。

arXiv:2507.15977v1 Announce Type: cross Abstract: Modern LLMs face inference efficiency challenges due to their scale. To address this, many compression methods have been proposed, such as pruning and quantization. However, the effect of compression on a model's interpretability remains elusive. While several model interpretation approaches exist, such as circuit discovery, Sparse Autoencoders (SAEs) have proven particularly effective in decomposing a model's activation space into its feature basis. In this work, we explore the differences in SAEs for the original and compressed models. We find that SAEs trained on the original model can interpret the compressed model albeit with slight performance degradation compared to the trained SAE on the compressed model. Furthermore, simply pruning the original SAE itself achieves performance comparable to training a new SAE on the pruned model. This finding enables us to mitigate the extensive training costs of SAEs.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

稀疏自动编码器模型压缩解释性性能评估

相关文章

Quantizing Transformers by Helping Attention Heads Do Nothing with Markus Nagel - #663

Evolving AI Systems Gracefully with Stefano Soatto - #502

Neural Network Quantization and Compression with Tijmen Blankevoort - TWIML Talk #292

Researchers at the University of Freiburg and Bosch AI Propose HW-GPT-Bench: A Hardware-Aware Language Model Surrogate Benchmark

Show HN: 搜索系统的评估指标

MLPerf Training 4.0 – Nvidia Still King; Power and LLM Fine Tuning Added

OpenAI: We're sharing progress toward understanding the neural activity of language models. We improved methods for training sparse autoencoders at sc...

三千预算买华硕笔记本？! a豆14除了粉粉嫩嫩还有什么本事？

Interpreting Preference Models w/ Sparse Autoencoders

Google Researchers Reveal Practical Insights into Knowledge Distillation for Model Compression