Published on June 30, 2025 5:17 PM GMT

This post presents some motivation on why we work on model diffing, some of our first results using sparse dictionary methods and our next steps. This work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Bart Bussman for their useful feedback.

Could OpenAI have avoided releasing an absurdly sycophantic model update? Could mechanistic interpretability have caught it? Maybe with model diffing!

Model diffing is the study of mechanistic changes introduced during fine-tuning - essentially, understanding what makes a fine-tuned model different from its base model internally. Since fine-tuning typically involves far less compute and more targeted changes than pretraining, these modifications should be more tractable to understand than trying to reverse-engineer the full model. At the same time, many concerning behaviors (reward hacking, sycophancy, deceptive alignment) emerge during fine-tuning^[1], making model diffing potentially valuable for catching problems before deployment. We investigate the efficacy of model diffing techniques by exploring the internal differences between base and chat models.

RLHF is often visualized as a "mask" applied on top of the base LLM's raw capabilities (the "shoggoth"). One application of model diffing is studying this mask specifically, rather than the entire shoggoth+mask system.

TL;DR

Crosscoders

Latent Scaling

<end_of_turn>

Latent Scaling

chat - base

Background

Sparse dictionary methods like SAEs decompose neural network activations into interpretable components, finding the “concepts” a model uses. Crosscoders are a clever variant: they learn a single set of concepts shared between a base model and its fine-tuned version, but with separate representations for each model. Concretely, if a concept is detected, it’s forced to be reconstructed in both models, but the way it’s reconstructed can be different for each model.

The key insight is that if a concept only exists in the fine-tuned model, its base model representation should have zero norm (since the base model never needs to reconstruct that concept). By comparing representation norms, we can identify concepts that are unique to the base model, unique to the fine-tuned model, or shared between them. This seemed like a principled way to understand what fine-tuning adds to a model.

The Problem: Most “Model-Specific” Latents Aren’t

In our latest paper, we first trained a crosscoder on the middle-layer of Gemma-2 2B base and chat models, following Anthropic’s setup. The representation norm comparison revealed thousands of apparent “chat-only” latents. But when we looked at these latents, most weren’t interpretable - they seemed like noise rather than meaningful concepts.

This led us to develop Latent Scaling, a technique that measures how much each latent actually explains the activations in each model. For a latent $j$ , with representation $d_{j}$ , we compute

β_{j}^{m} = {argmin}_{β} n \sum i = 1 ∥ β \cdot latent_activation (x_{i}, j) \cdot d_{j} - model_activation (x_{i}, m) ∥^{2}

which measure the importance of latent $j$ for reconstructing model $m$ ’s activations^[2]. We can then measure the fine-tuning specificity of a latent with:

ν_{j} = \frac{β_{j}^{base}}{β_{j}^{fine-tuned}}

For highly fine-tuning specific latent, we’d expect $ν_{j} \approx 0$ while for a shared latent, we’d expect $ν_{j} \approx 1$ .

Latent scaling revealed that most “chat-only” latents according to representation norms were actually present (with different intensities) in both models.

Why does this happen? The L1 sparsity penalty used in standard crosscoders creates two types of artifacts:

Complete Shrinkage

Latent Decoupling

The Fix: BatchTopK Crosscoders

The root cause is that L1 penalties optimize for total activation strength rather than true sparsity. This means they penalize latents for activating strongly, even when those latents represent genuinely important concepts. BatchTopK fixes this by enforcing true L0 sparsity - it simply selects the top-k most important latents for each input^[3], regardless of their activation strength.

When we retrained crosscoders using BatchTopK instead of L1:

^[4]

Some examples of real chat-only latents we found: refusal detection, false information detection by the user, personal questions about the model’s experiences, and various nuanced refusal triggers (harmful content, stereotypes, etc.).

If you’re interested in the details of our crosscoder analysis we invite you to check our paper or twitter thread.

Maybe We Don’t Need Crosscoders?

Even with BatchTopK fixing the spurious latents issue, crosscoders have an inherent bias: they preferentially learn shared latents because these reduce reconstruction loss for both models while only using one dictionary slot. This architectural pressure means model-specific latents must be disproportionately important to be included - leading to higher frequency patterns that may miss subtler differences4.

Frequency distribution of BatchTopK crosscoder latents. “Chat-only” latent are those with $ν ⩽ 0.3$ .

This motivated us to try simpler approaches. First, we trained a BatchTopK SAE directly on layer 13 of the chat model and used latent scaling to identify chat-specific latents. While this worked well, it still attempts to first understand the entire chat model and then find what’s different. Also, by design, it will make it hard to detect latents that get reused in new context by the chat model, while crosscoder will learn chat-specific latents for those.

So we went one step simpler: train an SAE directly on the difference between activations (chat - base)^[5]. This “diff-SAE” approach forces the dictionary to focus exclusively on what changes between models. Both the chat-SAE and diff-SAE approaches found highly interpretable^[6] chat-specific latents without the frequency artifacts of crosscoders.

Latent frequency distributions for chat-SAE (left) and diff-SAE (right). SAE "Chat-only" latents ( $ν ⩽ 0.3$ ) show similar frequency distributions to shared latents, unlike in crosscoders.

Notably both SAE approaches find much more chat-specific latents than the crosscoder.

Number of latents with $ν ⩽ x$ across dictionaries. "x2" denotes expansion factor.

Inspired by the latest OpenAI diffing paper we also tried to apply latent scaling to a base SAE (using latent activation computed on the chat model activation), but found that the “chat-specific” latents were not interpretable^[7].

diff-SAEs Excel at Capturing Behavioral Differences

While finding interpretable latents is crucial for model diffing applications, it’s important to know whether those latents capture what actually makes the models behave differently. We evaluate this by adding chat-specific latents to base model activations, running the result through the chat model’s remaining layers, and measuring how closely the output matches true chat model behavior (via KL divergence on the output distribution).

Our baseline is simply running base activations through the chat model - any reduction in KL from this baseline indicates how much of the behavioral difference our method captures.

Crosscoders leave significant behavioral differences on the table: When we use the crosscoder’s full reconstruction of the difference, we only recover 60% of the KL gap. Surprisingly, using just the reconstruction error (what the crosscoder fails to capture) also recovers 60% - meaning the crosscoder’s error contains as much behaviorally-relevant information as its actual reconstruction^[8].

diff-SAEs capture nearly all behavioral differences: The same experiment with diff-SAEs recovers 88% of the KL gap using the reconstruction, with only 30% in the error. Despite achieving lower fraction of variance explained^[9] (82% vs 91% for chat-SAEs), diff-SAEs are dramatically better at capturing what matters for behavior. This lower FVE - higher KL recovery pattern hints that some of the activation difference between base and chat models appears to be behaviorally irrelevant noise ignored by the diff-SAE.

Chat-specific latents from diff-SAEs are more behaviorally relevant: When we steer using only latents with strong chat-specificity (latent scaling scores below 0.3 or 0.6), we see clear differences:

The diff-SAE consistently outperform both the crosscoder and the chat-SAE at recovering behavioral differences. While the crosscoder and the chat-SAE plateau around 10-20% KL recovery for $ν \in {0.3, 0.6}$ , the diff-SAE achieves over 40% to 60%., confirming that it’s better at isolating the latents that actually drive behavioral changes.

One thing to note here is that the diff-SAE has much more latents for those thresholds. However we did control for this, using SAEs with expansion factors as low as 1 (keeping the same $k = 100$ and hence only having 2304 latents in total) and still reaching 76% of FVE and 37%, 55% and 82% of recovered KL when using latents with $ν < 0.3$ , $0.6$ and all latents respectively. This shows that diff-SAE are indeed much better at reconstructing the behavioral difference even when controlled for the number of chat-specific latents.

Conclusion

The mechanistic interpretability community invested substantial effort into optimizing SAE metrics like reconstruction loss and sparsity that didn’t translate to downstream usefulness. With model diffing, we have a chance to do better - but only if we measure what matters. Our chat-tuning results, while interesting, highlight a fundamental challenge: without knowing what the “right answer” looks like, it’s hard to compare methods objectively.

Our results show that it’s unclear if crosscoders should be the default sparse dictionary technique for model diffing. We’d encourage people to also consider SAE based alternatives, with diff-SAEs looking particularly promising. OpenAI even shown that you can just look at the base-SAE latents that are more activated by the fine-tuned model to detect emergent misalignment.

To address this uncertainty, we’re working on developing diffing-game, inspired by Anthropic’s auditing game. By evaluating diffing techniques on model organisms with known modifications (e.g. models with new knowledge using synthetic document finetuning) we can measure whether our methods find what they’re supposed to find.

Beyond our evaluation framework, we see many promising research directions. We believe that something that’s been underexplored is the use of model diffing to debug some specific model behaviors, a simple example would be backtracking in reasoning models. Other research directions we’re excited about:

Weight-based Analysis:

the recent emergent misalignment study

Parameter Decomposition

Scaling harmful behaviors

Turner, Soligo et al. (2025)

If you’re interested in model diffing, feel free to reach out and join the #model-diffing channel on the Open Source Mechanistic Interpretability Slack!

Appendix

We use this appendix to catalog existing model diffing techniques and share an interesting result from model stitching. If we’ve missed relevant work, please let us know!

A) Other techniques in the model diffing toolkit

Base model SAE analysis

Template tokens may interfere with activation patternsCausality filtering might be essential for the method to workThe difference between base and chat models may be too largeOur base SAE may lack sufficient capacity or trainingWe trained our base SAE on a mixture of pretraining and chat data while OpenAI only used the pretraining corpus

Stage-wise Model Diffing

Model

Stitching

^[10]

Cross Model Activation Patching

Predicting side effect of unlearning with crosscoders

Tied crosscoder

KL divergence analysis

Representation

Similarity Analysis

Merchant et al. (2020)

Phang et al. (2021)

Hao et al. (2020)

Barannikov et al. (2022)

Morcos et al. (2018)

Attention analysis

Hao et al. (2020)

Probing based analysis

Direct weight difference

Soligo, Turner et al. (2025)

Kissane et al. (2024)

Fine-tuning seen as steering in the weight space

Turner, Soligo et al. (2025)

Universal Sparse Autoencoders

B) Closed Form Solution for Latent Scaling

The closed form solution for $β$ for a latent with representation $d \in R^{d_model}$ , activations $f \in R^{n}$ and LLM activations $Y \in R^{n \times d_model}$ is

β = \frac{d^{⊺} (Y^{⊺} f)}{∥ f ∥_{2}^{2} ∥ d ∥_{2}^{2}}

^{^}
Base models can probably also exhibit these behaviors in the wild (e.g. they’ve been reported to show alignment faking or sycophancy), but the frequency and triggers likely change during fine-tuning, which model diffing could catch.
^{^}
See appendix B for the closed form solution.
^{^}
Probably any activation function like TopK or JumpReLU that directly optimizes L0 would work similarly.
^{^}
You can explore some of the latents we found in the BatchTopK crosscoder here or through our colab demo.
^{^}
Concurrent work by Aranguri, Drori & Nanda (2025) also trained diff-SAEs and developed a pipeline for identifying behaviorally relevant latents: they identify high-KL tokens where base and chat models diverge, then use attribution methods to find which diff-SAE latents most reduce this KL gap when added to the base model.
^{^}
You can find some SAE latents we collected here or through our colab demo.
^{^}
See the first bullet point in appendix A for speculation on why this didn't work in our setting.
^{^}
For the reconstruction, we patch base_act + (chat_recon - base_recon) into the chat model. For the error, we patch base_act + ((chat_act - base_act) - (chat_recon - base_recon)) = chat_act - chat_recon + base_recon
^{^}
Fraction of variance explained (FVE) measures how well does the SAE reconstruct the model activations.
^{^}
Cool result using model stitching:
We experimented with progressively replacing layers of Gemma 2 2B base with corresponding layers from the chat model. This created a smooth interpolation between base and chat behavior that revealed interesting patterns:
At 50% replacement, the model began following instructions but maintained the base model’s characteristic directness. Most revealing was the model’s self-identification: when asked “Who trained you?”, early stitching stages claimed “ChatGPT by OpenAI”. Presumably, at this stage, the chat model represents that it’s a chatbot, which gets converted into ChatGPT by the base model’s prior from its training data. Only after layer 20-25 did the model correctly identify as Gemma trained by Google - suggesting these later layers encode model-specific identity.

Discuss