稀疏自编码器 Sparse Auto Encoder(SAE)知识回路 Knowledge Circuits多义性 Polysemanticity
参考文献
Yunzhi, Yao, et al. Knowledge Circuits in Pretrained Transformers. NeurIPS 2024Yixin, Ou, et al. How do llms acquire new knowledge? a knowledge circuits perspective on continual pre-training. ACL 2025Huben, Robert, et al. "Sparse autoencoders find highly interpretable features in language models." The Twelfth International Conference on Learning Representations. 2023.Gao, Leo, et al. "Scaling and evaluating sparse autoencoders." The Thirteenth International Conference on Learning Representations.Anthropic. "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning." Transformer Circuits Thread. 2023. https://transformer-circuits.pub/2023/monosemantic-featuresAnthropic. "On the Biology of a Large Language Model." Transformer Circuits Thread. 2025. https://transformer-circuits.pub/2025/attribution-graphs/biology.htmlShu, Dong, et al. "A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models." arXiv. 2025.Wu, Xuansheng, et al. "Self-regularization with latent space explanations for controllable llm-based classification." KDD. 2025.Wu, Xuansheng, et al. "Interpreting and steering llms with mutual information-based explanations on sparse autoencoders." arXiv. 2025.
主讲人简介
姚云志,浙江大学计算机科学与技术学院博士生,导师为陈华钧教授与张宁豫教授,目前是UCLA的Nanyun Peng组访问研究学者。研究方向为大语言模型知识增强,知识编辑与可解释性。 吴烜圣,佐治亚大学计算机系四年级博士生,研究方向为可用的大语言模型解释性(Usable XAI),关注于如何更好地理解知识在模型隐空间的表示以实现更好的模型控制。他已发表同行评审论文14篇,累计引用量700+,曾于腾讯 AI Lab、百度 NLP、亚马逊Rufus等业内顶尖团队实习。