Published on March 19, 2025 9:29 PM GMT

In Sparse Feature Circuits (Marks et al. 2024), the authors introduced Spurious Human-Interpretable Feature Trimming (SHIFT), a technique designed to eliminate unwanted features from a model's computational process. They validate SHIFT on the Bias in Bios task, which we think is too simple to serve as meaningful validation. To summarize:

We don’t think the results in this post show that SHIFT is a bad method, but rather that the Bias in Bios dataset (or any other simple dataset) is not a good test bed to judge SHIFT or other similar methods. Follow-ups of SHIFT-like methods (e.g., Karvonen et al. (2025), Casademunt et al. (2025), SAE Bench) have already used more complex datasets and found promising results. However, these studies still focus on fairly toy settings, and we are not aware of research that focuses on disentangling safety-relevant concepts (e.g., sycophancy versus correctness in reward models)^[1].

In the rest of the post, we give some background on the method, share our experiment results showing that SAEs are not needed in the Bias in Bios setting, and speculate on how to properly evaluate/apply SHIFT.

This doc primarily focuses on the v2 version of the Marks et al. 2024 paper and only on results on pythia-70m-deduped. Still, since our main criticism is with the dataset used, we think it extends to experiments done with Gemma-2 2b.

Sorry for the collection of distinct Colab notebooks. I originally found these results in December and still haven’t shared them yet, so I decided to do it in whichever form I can.

Background on SHIFT

(This section is largely adapted/copied from the original paper)

Spurious Human-interpretable Feature Trimming (SHIFT) is a method for detecting undesirable cognition in a model and distinguishing behaviorally identical models with different internal aims (Marks 2024). In the context of the Sparse Feature Circuits paper, it’s also intended as a proof-of-concept on how SAEs could be useful in downstream tasks. The paper has the following experiment setup: We have access to some labeled training data $D = (x_{i}, y_{i})$ ; an LM-based classifier $C$ trained on $D$ ; and SAEs for various components of $C$ . To perform SHIFT, we:

1. Identify SAE latents that have the largest impact on $C$ ’s accuracy on inputs $(x, y) \sim D$ (e.g., using metric $m = - log C (y | x)$ ). These are calculated using integrated-gradient-based node attributions.

2. Manually inspect and evaluate each feature in the circuit from Step 1 for task relevance.

3. Ablate features judged to be task-irrelevant from $C$ to obtain a classifier $C^{'}$ .

4. (Optional) Further fine-tune $C^{'}$ on data from $D$ .

The authors applied SHIFT to the Bias in Bios (BiB; De-Arteaga et al., 2019) task. BiB consists of professional biographies, and the task is to classify an individual’s profession based on their biography. They focused on nurses and professors. Here are two random examples from the dataset:

She remains actively involved with the Burn Treatment Center after working there for nearly 18 years. Throughout these years she was a Nursing Assistant, Staff Nurse, Assistant Nurse Manager, and most recently the Nurse Manager. She is a member of the American Burn Association, the American Association of Critical Care Nurses, and the Health Information Management Systems Society. She has been co-director of Miracle Burn Camp since 2001 & assisted in creating the Young Adult Burn Survivor retreat in 2007. Alison is the mother of two and stays active as a board member for the local youth hockey association and keeping up with her boys. To Contact Alison, please: protected email

Dr. Bigelow graduated from the University of Illinois-Urbana in May 2004 with a Ph.D. in Electrical Engineering. After completing his education, he was a Visiting Assistant Professor in the Electrical and Computer Engineering Department at the University of Illinois at Urbana-Champaign for a year. Dr. Bigelow was then an Assistant Professor in Electrical Engineering at the University of North Dakota for three years prior to coming to Iowa State University in August 2008. More About

The goal is to train a linear probe on a biased dataset where gender is perfectly correlated with profession and de-bias the probe so that it generalizes to a test set where gender is not correlated with profession. The authors trained a linear probe on layer 4 of pythia-70m-deduped and mean-pool over all token positions.

An oracle probe trained on an unbiased training dataset, where gender is not correlated with profession, achieves an accuracy of 93%. Marks et al. train a probe on a biased dataset where gender is perfectly predictive of profession. When tested on the unbiased set, this probe is only 63% accurate.

By applying SHIFT and fine-tuning, they achieved an accuracy of 93%, comparable to when they had access to an unbiased dataset.

The SHIFT experiment in Marks et al. 2024 relies on embedding features.

I’d like to thank Diego Caples for first sharing this result with me.

In their original experiment, the authors ablated 55 features from various parts of the model to de-bias their probe. We find that if you only ablate the 10 features in the post-embedding SAE, the resulting probe achieves an out-of-sample accuracy of 85.6% (compared to 88.5% after ablating 55 features) and a worst group accuracy of 72.3% (compared to 76%). If you only ablate non-embedding SAE latents, the out-of-sample accuracy drops slightly to 84%, but the worst group accuracy drops to 58.5%.

(Google Colab link) See also Appendix A4 of Karvonen et al.

You can train an unbiased classifier just by deleting gender-related tokens from the data.

We removed gender-related tokens (pronouns, possessive pronouns, and gendered honorifics) from the Bias in Bios dataset. We then trained a linear probe on layer 1 of pythia-70m-deduped, mean-pooling over all token positions, and achieved an out-of-sample accuracy of 93.7% and a worst group accuracy of 89.2%. (If we follow the paper and train on layer 4, we get an accuracy of 86.5%. Not great, but still better than the biased baseline).

(Google Colab link)

In fact, for some models, you can train directly on the embedding (and de-bias by removing gender-related tokens)

Classifying profession on the Bias in Bios dataset is easy when we mean-pool over all token positions. We were able to train performative probes right after the embedding layer on a different model (training on pythia embeddings worked less well). Our oracle probe (i.e., trained on the original unbiased dataset) achieves an accuracy of 89%. Training on the biased dataset without gender-related tokens achieves an out-of-sample accuracy of 92.7% on the unbiased dataset. (For this specific seed, the de-biased probe achieved higher accuracy compared to the oracle probe.)

(Google Colab link)

If not BiB, how do we check that SHIFT works?

We see two safety-relevant applications for SHIFT. The first is improving our classifiers and reward models by removing their reliance on undesirable features. The second is direct cognition-based oversight of powerful models/DBIC. In both cases, SHIFT would need to demonstrate that it provides some utility above other simpler methods such as token–level ablations or linear probes. Preferably, it could show improvements that justify the extra cost and complexity of training SAEs. We sketch some possible experiments below.

SHIFT applied to classifiers and reward models

We could try to improve probe performance in situations where they need to rely on high-level features computed by the model (as opposed to token-level features that are easily extracted from the data). For example, if someone used SHIFT to make a better deception probe in Goldowsky-Dill et al. (2025)’s setting, that would be conclusive proof that SHIFT (and SAEs in general) are useful in downstream tasks.

We are also excited about using SAEs to find the most impactful features used in reward models and see if we could ablate or scale down irrelevant features (see e.g., Riggs and Brinkman’s work on interpreting preference models) to obtain, say, less sycophantic models. This idea was proposed in the original paper, but we haven’t seen anyone try it.

SHIFT for cognition-based oversight/disambiguating behaviorally identical classifiers

When a policy achieves high rewards in training, what type of cognition is behind that policy? We think that there could be models that behave identically in training but have different internal goals (e.g., “Do what the human wants” vs. “Do what looks good to humans”) and will act differently during deployment. One proposed application of SHIFT is cognition-based oversight and disambiguating behaviorally identical classifiers (DBIC) (Marks 2024). In short, we look for SAE latents that are implicated in downstream behavior and try to infer the cognitive state of the AI using those SAE latents. This way, we could potentially disambiguate between behaviorally identical models.

Future studies could try applying SHIFT to distinguish between regular models with model organisms of misaligned AI, such as reward hackers or situationally aware models that act differently during deployment. We should also benchmark SHIFT’s performance against other tools that can be used for DBIC such as linear probes and crosscoder-based model diffing approaches.

Next steps: Focus on what to disentangle, and not just how well you can disentangle them

Researchers have focused on how well a method can disentangle concepts, but we believe they should think harder about what concepts they want to disentangle in the first place. Past studies have tried to separate concept pairs like gender & profession, review sentiment & review product category, and Tokyo & being a city in Asia^[2]. These concepts are pretty far from the ones we would want to separate for safety reasons (i.e., doing what the human wants & doing what looks good to the human). Our methods are sufficiently advanced, and short timelines are sufficiently likely that we should studies these methods in safety-relevant settings now^[3].

Thanks to Adam Karvonen, Alice Rigg, Can Rager, Sam Marks, and Thomas Dooms for feedback and support.

^{^}
The recent alignment auditing paper from Anthropic is an example of interpretability applied to safety, but it's not very similar to SHIFT.
^{^}
This is from the RAVEL benchmark. We feel like RAVEL takes a stance on which concepts should be “atomic” in an LLM and tries to disentangle things that we're not sure need to be disentangled.
^{^}
This is different from applying them during training and deployment, which has its own issues.

Discuss

Background on SHIFT

The SHIFT experiment in Marks et al. 2024 relies on embedding features.

You can train an unbiased classifier just by deleting gender-related tokens from the data.

In fact, for some models, you can train directly on the embedding (and de-bias by removing gender-related tokens)

If not BiB, how do we check that SHIFT works?

SHIFT applied to classifiers and reward models

SHIFT for cognition-based oversight/disambiguating behaviorally identical classifiers

Next steps: Focus on what to disentangle, and not just how well you can disentangle them

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签