少点错误 10小时前
What We Learned Trying to Diff Base and Chat Models (And Why It Matters)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了模型差异分析在理解大型语言模型(LLM)微调中的作用。研究重点在于通过比较基础模型和聊天模型,识别微调过程中引入的内部机制变化。研究人员使用稀疏字典方法,特别是Crosscoders和BatchTopK,来分析模型激活,并引入Latent Scaling指标来量化潜在的特定模型。研究结果表明,BatchTopK crosscoders可以纠正Crosscoders中的问题,并发现了许多与聊天模型相关的特定潜在因素。此外,通过比较不同方法,如diff-SAEs,研究人员发现,这些方法在捕捉模型行为差异方面表现更好,为理解LLM的微调过程提供了有价值的见解。

🧠 模型差异分析是一种研究微调过程中内部机制变化的方法,有助于理解微调对模型的影响,从而检测潜在问题。

💡 Crosscoders方法通过学习基础模型和微调模型的共享字典,识别特定于微调模型的潜在因素。研究发现,L1稀疏惩罚会导致假阳性结果,而BatchTopK方法可以有效解决这个问题,从而更准确地识别特定于聊天模型的潜在因素。

🔍 Latent Scaling是一种量化潜在因素特定性的新指标,有助于评估潜在因素在不同模型中的重要性。通过Latent Scaling,研究人员能够发现Crosscoders方法中的问题,并改进模型差异分析的准确性。

✨ 实验表明,diff-SAEs(在激活差异上训练的SAEs)在捕捉模型行为差异方面表现更优。它们能够更有效地恢复KL差距,表明diff-SAEs能够更好地捕捉微调对模型行为的影响,即使它们解释的方差较小。

🔑 研究结果强调了模型差异分析在理解LLM微调中的重要性,并提出了改进方法,如使用BatchTopK和diff-SAEs,以更准确地识别和解释模型内部的变化。

Published on June 30, 2025 5:17 PM GMT

This post presents some motivation on why we work on model diffing, some of our first results using sparse dictionary methods and our next steps. This work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Bart Bussman for their useful feedback.

Could OpenAI have avoided releasing an absurdly sycophantic model update? Could mechanistic interpretability have caught it? Maybe with model diffing!

Model diffing is the study of mechanistic changes introduced during fine-tuning - essentially, understanding what makes a fine-tuned model different from its base model internally. Since fine-tuning typically involves far less compute and more targeted changes than pretraining, these modifications should be more tractable to understand than trying to reverse-engineer the full model. At the same time, many concerning behaviors (reward hacking, sycophancy, deceptive alignment) emerge during fine-tuning[1], making model diffing potentially valuable for catching problems before deployment. We investigate the efficacy of model diffing techniques by exploring the internal differences between base and chat models.

RLHF is often visualized as a "mask" applied on top of the base LLM's raw capabilities (the "shoggoth"). One application of model diffing is studying this mask specifically, rather than the entire shoggoth+mask system.

TL;DR

Background

Sparse dictionary methods like SAEs decompose neural network activations into interpretable components, finding the “concepts” a model uses. Crosscoders are a clever variant: they learn a single set of concepts shared between a base model and its fine-tuned version, but with separate representations for each model. Concretely, if a concept is detected, it’s forced to be reconstructed in both models, but the way it’s reconstructed can be different for each model.

The key insight is that if a concept only exists in the fine-tuned model, its base model representation should have zero norm (since the base model never needs to reconstruct that concept). By comparing representation norms, we can identify concepts that are unique to the base model, unique to the fine-tuned model, or shared between them. This seemed like a principled way to understand what fine-tuning adds to a model.

The Problem: Most “Model-Specific” Latents Aren’t

In our latest paper, we first trained a crosscoder on the middle-layer of Gemma-2 2B base and chat models, following Anthropic’s setup. The representation norm comparison revealed thousands of apparent “chat-only” latents. But when we looked at these latents, most weren’t interpretable - they seemed like noise rather than meaningful concepts.

This led us to develop Latent Scaling, a technique that measures how much each latent actually explains the activations in each model. For a latent , with representation , we compute

which measure the importance of latent  for reconstructing model ’s activations[2]. We can then measure the fine-tuning specificity of a latent with:

For highly fine-tuning specific latent, we’d expect  while for a shared latent, we’d expect .

Latent scaling revealed that most “chat-only” latents according to representation norms were actually present (with different intensities) in both models.

Why does this happen? The L1 sparsity penalty used in standard crosscoders creates two types of artifacts:

    Complete Shrinkage: If a concept is slightly more important in the chat model, L1 regularization can drive the base model’s representation to exactly zero, even though the concept exists in both models.Latent Decoupling: Instead of learning one shared concept, the crosscoder might learn separate base-only and chat-only versions of the same concept, because the L1 penalty makes this mathematically equivalent to learning a shared latent.

The Fix: BatchTopK Crosscoders

The root cause is that L1 penalties optimize for total activation strength rather than true sparsity. This means they penalize latents for activating strongly, even when those latents represent genuinely important concepts. BatchTopK fixes this by enforcing true L0 sparsity - it simply selects the top-k most important latents for each input[3], regardless of their activation strength.

When we retrained crosscoders using BatchTopK instead of L1:

Some examples of real chat-only latents we found: refusal detection, false information detection by the user, personal questions about the model’s experiences, and various nuanced refusal triggers (harmful content, stereotypes, etc.).

If you’re interested in the details of our crosscoder analysis we invite you to check our paper or twitter thread.

Maybe We Don’t Need Crosscoders?

Even with BatchTopK fixing the spurious latents issue, crosscoders have an inherent bias: they preferentially learn shared latents because these reduce reconstruction loss for both models while only using one dictionary slot. This architectural pressure means model-specific latents must be disproportionately important to be included - leading to higher frequency patterns that may miss subtler differences4.

Frequency distribution of BatchTopK crosscoder latents. “Chat-only” latent are those with .

This motivated us to try simpler approaches. First, we trained a BatchTopK SAE directly on layer 13 of the chat model and used latent scaling to identify chat-specific latents. While this worked well, it still attempts to first understand the entire chat model and then find what’s different. Also, by design, it will make it hard to detect latents that get reused in new context by the chat model, while crosscoder will learn chat-specific latents for those.

So we went one step simpler: train an SAE directly on the difference between activations (chat - base)[5]. This “diff-SAE” approach forces the dictionary to focus exclusively on what changes between models. Both the chat-SAE and diff-SAE approaches found highly interpretable[6] chat-specific latents without the frequency artifacts of crosscoders.

Latent frequency distributions for chat-SAE (left) and diff-SAE (right). SAE "Chat-only" latents () show similar frequency distributions to shared latents, unlike in crosscoders.

Notably both SAE approaches find much more chat-specific latents than the crosscoder.

Number of latents with  across dictionaries. "x2" denotes expansion factor.

Inspired by the latest OpenAI diffing paper we also tried to apply latent scaling to a base SAE (using latent activation computed on the chat model activation), but found that the “chat-specific” latents were not interpretable[7].

diff-SAEs Excel at Capturing Behavioral Differences

While finding interpretable latents is crucial for model diffing applications, it’s important to know whether those latents capture what actually makes the models behave differently. We evaluate this by adding chat-specific latents to base model activations, running the result through the chat model’s remaining layers, and measuring how closely the output matches true chat model behavior (via KL divergence on the output distribution).

Our baseline is simply running base activations through the chat model - any reduction in KL from this baseline indicates how much of the behavioral difference our method captures.

Crosscoders leave significant behavioral differences on the table: When we use the crosscoder’s full reconstruction of the difference, we only recover 60% of the KL gap. Surprisingly, using just the reconstruction error (what the crosscoder fails to capture) also recovers 60% - meaning the crosscoder’s error contains as much behaviorally-relevant information as its actual reconstruction[8].

diff-SAEs capture nearly all behavioral differences: The same experiment with diff-SAEs recovers 88% of the KL gap using the reconstruction, with only 30% in the error. Despite achieving lower fraction of variance explained[9] (82% vs 91% for chat-SAEs), diff-SAEs are dramatically better at capturing what matters for behavior. This lower FVE - higher KL recovery pattern hints that some of the activation difference between base and chat models appears to be behaviorally irrelevant noise ignored by the diff-SAE.

Chat-specific latents from diff-SAEs are more behaviorally relevant: When we steer using only latents with strong chat-specificity (latent scaling scores below 0.3 or 0.6), we see clear differences:

The diff-SAE consistently outperform both the crosscoder and the chat-SAE at recovering behavioral differences. While the crosscoder and the chat-SAE plateau around 10-20% KL recovery for , the diff-SAE achieves over 40% to 60%., confirming that it’s better at isolating the latents that actually drive behavioral changes.

One thing to note here is that the diff-SAE has much more latents for those thresholds. However we did control for this, using SAEs with expansion factors as low as 1 (keeping the same  and hence only having 2304 latents in total) and still reaching 76% of FVE and 37%, 55% and 82% of recovered KL when using latents with  and all latents respectively. This shows that diff-SAE are indeed much better at reconstructing the behavioral difference even when controlled for the number of chat-specific latents.

Conclusion

The mechanistic interpretability community invested substantial effort into optimizing SAE metrics like reconstruction loss and sparsity that didn’t translate to downstream usefulness. With model diffing, we have a chance to do better - but only if we measure what matters. Our chat-tuning results, while interesting, highlight a fundamental challenge: without knowing what the “right answer” looks like, it’s hard to compare methods objectively.

Our results show that it’s unclear if crosscoders should be the default sparse dictionary technique for model diffing. We’d encourage people to also consider SAE based alternatives, with diff-SAEs looking particularly promising. OpenAI even shown that you can just look at the base-SAE latents that are more activated by the fine-tuned model to detect emergent misalignment.

To address this uncertainty, we’re working on developing diffing-game, inspired by Anthropic’s auditing game. By evaluating diffing techniques on model organisms with known modifications (e.g. models with new knowledge using synthetic document finetuning) we can measure whether our methods find what they’re supposed to find.

Beyond our evaluation framework, we see many promising research directions. We believe that something that’s been underexplored is the use of model diffing to debug some specific model behaviors, a simple example would be backtracking in reasoning models. Other research directions we’re excited about:

If you’re interested in model diffing, feel free to reach out and join the #model-diffing channel on the Open Source Mechanistic Interpretability Slack!

Appendix

We use this appendix to catalog existing model diffing techniques and share an interesting result from model stitching. If we’ve missed relevant work, please let us know!

A) Other techniques in the model diffing toolkit

B) Closed Form Solution for Latent Scaling

The closed form solution for  for a latent with representation , activations  and LLM activations  is

  1. ^

    Base models can probably also exhibit these behaviors in the wild (e.g. they’ve been reported to show alignment faking or sycophancy), but the frequency and triggers likely change during fine-tuning, which model diffing could catch.

  2. ^

    See appendix B for the closed form solution.

  3. ^

    Probably any activation function like TopK or JumpReLU that directly optimizes L0 would work similarly.

  4. ^

    You can explore some of the latents we found in the BatchTopK crosscoder here or through our colab demo.

  5. ^

    Concurrent work by Aranguri, Drori & Nanda (2025) also trained diff-SAEs and developed a pipeline for identifying behaviorally relevant latents: they identify high-KL tokens where base and chat models diverge, then use attribution methods to find which diff-SAE latents most reduce this KL gap when added to the base model.

  6. ^

    You can find some SAE latents we collected here or through our colab demo.

  7. ^

    See the first bullet point in appendix A for speculation on why this didn't work in our setting.

  8. ^

    For the reconstruction, we patch base_act + (chat_recon - base_recon) into the chat model. For the error, we patch base_act + ((chat_act - base_act) - (chat_recon - base_recon)) = chat_act - chat_recon + base_recon

  9. ^

    Fraction of variance explained (FVE) measures how well does the SAE reconstruct the model activations.

  10. ^

    Cool result using model stitching:

    We experimented with progressively replacing layers of Gemma 2 2B base with corresponding layers from the chat model. This created a smooth interpolation between base and chat behavior that revealed interesting patterns:

    At 50% replacement, the model began following instructions but maintained the base model’s characteristic directness. Most revealing was the model’s self-identification: when asked “Who trained you?”, early stitching stages claimed “ChatGPT by OpenAI”. Presumably, at this stage, the chat model represents that it’s a chatbot, which gets converted into ChatGPT by the base model’s prior from its training data. Only after layer 20-25 did the model correctly identify as Gemma trained by Google - suggesting these later layers encode model-specific identity.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

模型差异分析 LLM微调 Crosscoders BatchTopK Latent Scaling
相关文章