少点错误 07月20日 03:33
From Messy Shelves to Master Librarians: Toy-Model Exploration of Block-Diagonal Geometry in LM Activations
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了大型语言模型(LLM)特征空间的内在几何结构,特别是其块对角特性。研究人员通过测量余弦相似度热图,发现LLM的激活空间并非随机或完全正交,而是呈现出与语义类别相关的块状聚集。文章还提出并验证了特征的层级子聚类和线性共线性等假设,为设计能够匹配这种自然几何结构的稀疏自编码器(SAE)先验奠定了基础。这些发现将指导后续研究,通过引入块结构和图拉普拉斯先验来优化V-SAE和Crosscoders,以更有效地解耦和表示语言模型中的概念。

📦 **块对角结构揭示语义聚类:** 研究发现LLM的特征空间并非随机分布,而是呈现出明显的块对角结构。这意味着属于同一语义类别的概念(如家庭成员关系)会聚集在一起,形成相似度矩阵中的高相关性“块”,而不同类别之间的关联则相对较弱,这为理解和利用LLM的内部表示提供了重要线索。

🌳 **层级子聚类与线性共线性:** 在每个语义类别块内部,进一步发现了层级子聚类现象,即更精细的语义关系(如父子、兄弟、祖父母)会形成嵌套的子簇。此外,相关概念的嵌入向量倾向于分布在同一条直线或小范围内,使得一个概念的嵌入可以通过另一个概念的线性组合来预测,揭示了语义关系中的线性依赖性。

💡 **几何先验设计指导:** 基于对特征空间几何结构的理解,文章提出可以设计与这些结构相匹配的稀疏自编码器(SAE)先验。例如,利用块结构高斯协方差、树拉普拉斯正则化或双曲先验等方法,使SAE能够更好地捕捉和保留LLM的真实语义结构,避免人为的正交化,从而实现更有效的特征解耦。

🚀 **工程应用与未来展望:** 本文的发现为后续将块结构和图拉普拉斯先验整合到V-SAE和Crosscoders模型中奠定了基础。通过在真实的LLM激活上应用这些几何感知的先验,旨在提升模型对语义信息的表示能力,并为理解和操控LLM的内部工作机制提供更强大的工具。

Published on July 19, 2025 12:26 PM GMT

by Yuxiao Li, Zachary Baker, Maxim Panteleev, Maxim Finenko

June 2025 | SPAR Spring '25

A post in our series "Feature Geometry & Structured Priors in Sparse Autoencoders"

TL;DR: We explore the intrinsic block-diagonal geometry of LLM feature space--first observed in raw embeddings and family-tree probes--by measuring cosine-similarity heatmaps. These diagnostics set the stage for baking block-structured and graph-Laplacian priors into V-SAEs and Crosscoders in later posts. Assumptions. tbd.

About this series

This is the second post of our series on how realistic feature geometry in language model (LM) embeddings can be discovered and then encoded into sparse autoencoder (SAE) priors. Since February, we have combined probabilistic modeling, geometric analysis, and mechanistic interpretability.

Series Table of Contents

Part I: Toy model comparison of isotropic vs global-correlation priors in V-SAE

➡️ Part II (you are here): Block-diagonal structures in toy LM activations

Part III: Crosscoders & Ladder SAEs (multi-layer, multi-resolution coding)

Part IV: Revisiting the Linear Representation Hypothesis (LRH) via geometirc probes


0. Background Story

Imagine walking into a library where every book—on wolves, quantum mechanics, and Greek myths—is piled into a single cramped shelf, pages interleaved and impossible to separate. This is the chaos of superposition in neural activations: too many concepts squeezed into too few dimensions. Traditional SAEs deal with the problem by demanding perfect orthogonality—like a librarian who insists each title stand ten feet apart—scattering lions and tigers to opposite ends of the building. In this post, we instead first map out the library’s natural wings—block-diagonal clusters of related concepts in LLM activations—so that when we design our SAE priors, we preserve true semantic neighborhoods while still untangling every hidden concept.

I. From Priors to Geometry: Why We Need to "Look Under the Hood"

Last week we kicked off this series with a dive into Variational SAEs (V-SAEs) and show: the ability to sculpt feature disentanglement via variational priors. On simple synthetic benchmarks, the vanilla SAE struggled to separate correlated latents, whereas a global-correlation prior dramatically purified fetaures. But one question remained:

What kind of prior should we choose if we want SAEs to capture the rich, categoircal geometry we actually see in language models?

Answering that requires exploring the true layout of the SAE's latent space and the model's raw embedding space. Only once we know how concepts cluster "in the wild" can we bake those patterns back into our priors--sharpening real semantic blocks and eliminating spurious overlap.

II. Related Work: Mapping and Manipulating Semantic Geometry

A growing body of work has begun to chart the geometry of learned representations--and to use that map to guide feature learning:

Taken together, these works paint a picture: LLM feature spaces are neither randomly nor purely against orthogonally when presenting superposition--they exhibit structured geometry that we can and should exploit.

III. Propositions on Feature Geometry (Informal)

Before designing SAE priors, we posit three core hypotheses about how relational features arrange themselves in latent space:

Informal Proposition 1 (Categorical Block-Diagonal Structure). Embeddings of relations from the same semantic category cluster together, yielding a block-diagonal pattern in the pairwise similarity matrix. Off-block correlations (across distinct categories) remain near zero.

 

Informal Proposition 2 (Hierarchical Sub-Clustering). Within each category's block, finer sub-relations form nested sub-clusters: a two-level hierarchy that can be revealed by removing global distractor dimensions.

 

Informal Proposition 3 (Linear Co-Linearity of Related Roles). Embeddings of semantically related roles lie nearly on a common line--or small subspaces--so that one can predict one embedding as a linear combination of another plus a small residual.

Finally, these observed patterns suggest a guiding principle for our next step:

Informal Proposition 4 (Geometry-Aware Priors Refine Features). If we can measure the latent geometry (blocks, hierarchies, co-linearity), then we can engineer V-SAE priors--via block-structured covariance, tree-based Laplacians, or hyperbolic priors--to align the learned dictionary with that geometry, avoiding artificial orthogonality and preserving true semantic continuity.

III. Empirical Structure Discovery

We tested Propositions 1-3 on synthetic family-tree embeddings in toy experiments.

IV. From Geometry to Prior Design

Our key insight:

"If we can measure structure in latent space (blocks, hierarchies, co-linearity), we can engineer priors to match that structure--using KL-divergence or graph regularizers to guide SAE learning accordingly."

Concrete next-step priors:

    Block-Structured Gaussian Covariance.
     with intra-block correlation ρ\rhoρ.Tree-Laplacian Regularizer.
    Build a small family-tree graph , compute its Laplacian ​, and add  to encourage activations  to vary smoothly along known edges (parent–child vs. sibling).Hyperbolic / Poincaré Priors.
    For truly hierarchical relations, a hyperbolic latent prior naturally embeds tree distances and can be plugged into a variational framework (e.g. Poincaré VAE).

These geometry-aware priors will form the backbone of our Part III experiments, where we deploy V-SAEs and Crosscoders on real LLM activations—enforcing the semantic blocks and hierarchies we have now fully characterized.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 稀疏自编码器 特征几何 语义空间 AI研究
相关文章