From Messy Shelves to Master Librarians: Toy-Model Exploration of Block-Diagonal Geometry in LM Activations

Published on July 19, 2025 12:26 PM GMT

by Yuxiao Li, Zachary Baker, Maxim Panteleev, Maxim Finenko

June 2025 | SPAR Spring '25

A post in our series "Feature Geometry & Structured Priors in Sparse Autoencoders"

TL;DR: We explore the intrinsic block-diagonal geometry of LLM feature space--first observed in raw embeddings and family-tree probes--by measuring cosine-similarity heatmaps. These diagnostics set the stage for baking block-structured and graph-Laplacian priors into V-SAEs and Crosscoders in later posts. Assumptions. tbd.

About this series

This is the second post of our series on how realistic feature geometry in language model (LM) embeddings can be discovered and then encoded into sparse autoencoder (SAE) priors. Since February, we have combined probabilistic modeling, geometric analysis, and mechanistic interpretability.

Series Table of Contents

Part I: Toy model comparison of isotropic vs global-correlation priors in V-SAE

➡️ Part II (you are here): Block-diagonal structures in toy LM activations

Part III: Crosscoders & Ladder SAEs (multi-layer, multi-resolution coding)

Part IV: Revisiting the Linear Representation Hypothesis (LRH) via geometirc probes

0. Background Story

Imagine walking into a library where every book—on wolves, quantum mechanics, and Greek myths—is piled into a single cramped shelf, pages interleaved and impossible to separate. This is the chaos of superposition in neural activations: too many concepts squeezed into too few dimensions. Traditional SAEs deal with the problem by demanding perfect orthogonality—like a librarian who insists each title stand ten feet apart—scattering lions and tigers to opposite ends of the building. In this post, we instead first map out the library’s natural wings—block-diagonal clusters of related concepts in LLM activations—so that when we design our SAE priors, we preserve true semantic neighborhoods while still untangling every hidden concept.

I. From Priors to Geometry: Why We Need to "Look Under the Hood"

Last week we kicked off this series with a dive into Variational SAEs (V-SAEs) and show: the ability to sculpt feature disentanglement via variational priors. On simple synthetic benchmarks, the vanilla SAE struggled to separate correlated latents, whereas a global-correlation prior dramatically purified fetaures. But one question remained:

What kind of prior should we choose if we want SAEs to capture the rich, categoircal geometry we actually see in language models?

Answering that requires exploring the true layout of the SAE's latent space and the model's raw embedding space. Only once we know how concepts cluster "in the wild" can we bake those patterns back into our priors--sharpening real semantic blocks and eliminating spurious overlap.

II. Related Work: Mapping and Manipulating Semantic Geometry

A growing body of work has begun to chart the geometry of learned representations--and to use that map to guide feature learning:

Toy Models of Superposition (2022)

superposition phenomenon

Decomposing LMs with Dictionary Learning (2023)

uncover monosemantic feature vectors

The Geometry of Concepts: Sparse Autoencoder Feature Structure (2024)

The Geometry of Categorical and Hierarchical Concepts in LLMs (2024)

Hints of Universality in LLM Knowledge Graph Learning (2024)

SAEs Do Not Find Canonical Units of Analysis (Meta-SAEs) (2024)

Taken together, these works paint a picture: LLM feature spaces are neither randomly nor purely against orthogonally when presenting superposition--they exhibit structured geometry that we can and should exploit.

III. Propositions on Feature Geometry (Informal)

Before designing SAE priors, we posit three core hypotheses about how relational features arrange themselves in latent space:

Informal Proposition 1 (Categorical Block-Diagonal Structure). Embeddings of relations from the same semantic category cluster together, yielding a block-diagonal pattern in the pairwise similarity matrix. Off-block correlations (across distinct categories) remain near zero.

Informal Proposition 2 (Hierarchical Sub-Clustering). Within each category's block, finer sub-relations form nested sub-clusters: a two-level hierarchy that can be revealed by removing global distractor dimensions.

Informal Proposition 3 (Linear Co-Linearity of Related Roles). Embeddings of semantically related roles lie nearly on a common line--or small subspaces--so that one can predict one embedding as a linear combination of another plus a small residual.

Finally, these observed patterns suggest a guiding principle for our next step:

Informal Proposition 4 (Geometry-Aware Priors Refine Features). If we can measure the latent geometry (blocks, hierarchies, co-linearity), then we can engineer V-SAE priors--via block-structured covariance, tree-based Laplacians, or hyperbolic priors--to align the learned dictionary with that geometry, avoiding artificial orthogonality and preserving true semantic continuity.

III. Empirical Structure Discovery

We tested Propositions 1-3 on synthetic family-tree embeddings in toy experiments.

Block-Diagonal Correlation (Prop. 1)

Method:

{ij}Sij for relations ${p a r e n t, s i b l i n g, s p o u s e, u n c l e, g r a n d p a r e n t}$ , ordering indices by category (nuclear vs. in-law).Result: Two bright diagonal blocks (mean intra-block ≈ 0.72) and faint off-block values (mean ≈ 0.18), confirming clear categorical separation.

Hierarchical Sub-Clustering (Prop. 2)
Method: Project out global distractors via LDA, then run t-SNE on the 64-dim codes and apply spectral clustering within each block.Result: Within the “nuclear family” block, three tight sub-clusters emerged matching ${p a r e n t - c h i l d, s i b l i n g, g r a n d p a r e n t}$ , and similarly distinct sub-clusters in the in-law block once distractors were removed.
Linear Co-Linearity (Prop. 3)

Method: Fit a linear regressor from $<span class="mjx-math" aria-label="[z$

{\rm sibling}]"> to predict $<span class="mjx-math" aria-label="z$

Result:

IV. From Geometry to Prior Design

Our key insight:

"If we can measure structure in latent space (blocks, hierarchies, co-linearity), we can engineer priors to match that structure--using KL-divergence or graph regularizers to guide SAE learning accordingly."

Concrete next-step priors:

Block-Structured Gaussian Covariance.

Σ_{p} = b l o c k d i a g (Σ_{n u c F a m}, Σ_{i n L a w})

Tree-Laplacian Regularizer.

G

L_{G}

λ T r (Z^{⊤} L_{G} Z)

z

Hyperbolic / Poincaré Priors.

These geometry-aware priors will form the backbone of our Part III experiments, where we deploy V-SAEs and Crosscoders on real LLM activations—enforcing the semantic blocks and hierarchies we have now fully characterized.

Discuss

About this series

0. Background Story

I. From Priors to Geometry: Why We Need to "Look Under the Hood"

II. Related Work: Mapping and Manipulating Semantic Geometry

III. Propositions on Feature Geometry (Informal)

III. Empirical Structure Discovery

IV. From Geometry to Prior Design

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签