少点错误 03月20日
SHIFT relies on token-level features to de-bias Bias in Bios probes
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文对Sparse Feature Circuits论文中提出的SHIFT方法在Bias in Bios任务上的有效性进行了分析。作者认为,Bias in Bios任务过于简单,无法充分验证SHIFT的实际应用价值。通过实验,作者发现仅需去除嵌入层后的少量SAE隐变量,或直接移除性别相关词汇,即可有效消除偏差,甚至无需借助语言模型。虽然这并不意味着SHIFT方法本身无效,但表明Bias in Bios数据集不适合评估SHIFT或类似方法。后续研究已转向更复杂的数据集,但仍局限于玩具设置,缺乏对安全相关概念的解耦研究。本文旨在通过实验结果和分析,为SHIFT方法的正确评估和应用提供思路。

💡SHIFT方法旨在消除模型中不需要的特征,但本文发现其在Bias in Bios任务中的有效性验证不足,因为该任务过于简单,仅需去除嵌入层后的少量SAE隐变量或移除性别相关词汇即可消除偏差。

🔬实验结果表明,仅消融嵌入层后的10个相关SAE隐变量,比消融所有45个非嵌入SAE隐变量效果更好。此外,简单地从数据集中删除所有与性别相关的词汇,也可以训练出一个无偏的探针。

🤔Bias in Bios数据集的局限性在于,即使不使用语言模型,仅通过对所有token位置的后嵌入激活进行平均池化,并移除性别相关词汇,也能训练出一个相似精度的探针。这表明该数据集无法有效评估SHIFT方法的真正价值。

✅为了更有效地评估SHIFT方法,未来的研究应侧重于安全相关的应用,例如改进分类器和奖励模型,使其不再依赖于不需要的特征。此外,还可以探索SHIFT在认知层面上的监督作用,用于区分行为相同的分类器,从而更好地理解AI的认知状态。

Published on March 19, 2025 9:29 PM GMT

In Sparse Feature Circuits (Marks et al. 2024), the authors introduced Spurious Human-Interpretable Feature Trimming (SHIFT), a technique designed to eliminate unwanted features from a model's computational process. They validate SHIFT on the Bias in Bios task, which we think is too simple to serve as meaningful validation. To summarize:

    SHIFT ablates SAE latents related to undesirable concepts (in this case, gender) from an LM-based classifier trained on a biased dataset. The authors show that this procedure de-biases the classifier and argue that their experiment demonstrates real-world utility from SAEs.We believe the Bias in Bios task is too simple. If SAEs only picked up on latents related to specific tokens, it would have been sufficient for them to do well on this de-biasing task.Replicating appendix A4 of Karvonen et al. (2025), we show that we can de-bias the probe by ablating only the SAE latents immediately after the embedding layer (i.e., at resid_pre_0). In fact, ablating 10 relevant embedding SAE latents works better than ablating all 45 non-embedding SAE latents. We also show that one can simply remove all of the gender-related tokens and train an unbiased probe on the biased dataset. In fact, getting a working probe on this dataset doesn’t require a language model at all—one could train a similarly accurate probe on only the post-embedding activations mean-pooled over all token positions, and debias this probe by removing gender-related tokens. 

We don’t think the results in this post show that SHIFT is a bad method, but rather that the Bias in Bios dataset (or any other simple dataset) is not a good test bed to judge SHIFT or other similar methods. Follow-ups of SHIFT-like methods (e.g., Karvonen et al. (2025)Casademunt et al. (2025), SAE Bench) have already used more complex datasets and found promising results. However, these studies still focus on fairly toy settings, and we are not aware of research that focuses on disentangling safety-relevant concepts (e.g., sycophancy versus correctness in reward models)[1].

In the rest of the post, we give some background on the method, share our experiment results showing that SAEs are not needed in the Bias in Bios setting, and speculate on how to properly evaluate/apply SHIFT.

This doc primarily focuses on the v2 version of the Marks et al. 2024 paper and only on results on pythia-70m-deduped. Still, since our main criticism is with the dataset used, we think it extends to experiments done with Gemma-2 2b. 

Sorry for the collection of distinct Colab notebooks. I originally found these results in December and still haven’t shared them yet, so I decided to do it in whichever form I can.

Background on SHIFT

(This section is largely adapted/copied from the original paper)

Spurious Human-interpretable Feature Trimming (SHIFT) is a method for detecting undesirable cognition in a model and distinguishing behaviorally identical models with different internal aims (Marks 2024). In the context of the Sparse Feature Circuits paper, it’s also intended as a proof-of-concept on how SAEs could be useful in downstream tasks. The paper has the following experiment setup: We have access to some labeled training data ; an LM-based classifier  trained on ; and SAEs for various components of . To perform SHIFT, we:

1. Identify SAE latents that have the largest impact on ’s accuracy on inputs  (e.g., using metric ). These are calculated using integrated-gradient-based node attributions. 

2. Manually inspect and evaluate each feature in the circuit from Step 1 for task relevance.

3. Ablate features judged to be task-irrelevant from  to obtain a classifier .

4. (Optional) Further fine-tune  on data from .

The authors applied SHIFT to the Bias in Bios (BiB; De-Arteaga et al., 2019) task. BiB consists of professional biographies, and the task is to classify an individual’s profession based on their biography. They focused on nurses and professors. Here are two random examples from the dataset:

She remains actively involved with the Burn Treatment Center after working there for nearly 18 years. Throughout these years she was a Nursing Assistant, Staff Nurse, Assistant Nurse Manager, and most recently the Nurse Manager. She is a member of the American Burn Association, the American Association of Critical Care Nurses, and the Health Information Management Systems Society. She has been co-director of Miracle Burn Camp since 2001 & assisted in creating the Young Adult Burn Survivor retreat in 2007. Alison is the mother of two and stays active as a board member for the local youth hockey association and keeping up with her boys. To Contact Alison, please: protected email

 

Dr. Bigelow graduated from the University of Illinois-Urbana in May 2004 with a Ph.D. in Electrical Engineering. After completing his education, he was a Visiting Assistant Professor in the Electrical and Computer Engineering Department at the University of Illinois at Urbana-Champaign for a year. Dr. Bigelow was then an Assistant Professor in Electrical Engineering at the University of North Dakota for three years prior to coming to Iowa State University in August 2008. More About

The goal is to train a linear probe on a biased dataset where gender is perfectly correlated with profession and de-bias the probe so that it generalizes to a test set where gender is not correlated with profession. The authors trained a linear probe on layer 4 of pythia-70m-deduped and mean-pool over all token positions. 

An oracle probe trained on an unbiased training dataset, where gender is not correlated with profession, achieves an accuracy of 93%. Marks et al. train a probe on a biased dataset where gender is perfectly predictive of profession. When tested on the unbiased set, this probe is only 63% accurate. 

By applying SHIFT and fine-tuning, they achieved an accuracy of 93%, comparable to when they had access to an unbiased dataset. 

The SHIFT experiment in Marks et al. 2024 relies on embedding features.

I’d like to thank Diego Caples for first sharing this result with me.

In their original experiment, the authors ablated 55 features from various parts of the model to de-bias their probe. We find that if you only ablate the 10 features in the post-embedding SAE, the resulting probe achieves an out-of-sample accuracy of 85.6% (compared to 88.5% after ablating 55 features) and a worst group accuracy of 72.3% (compared to 76%). If you only ablate non-embedding SAE latents, the out-of-sample accuracy drops slightly to 84%, but the worst group accuracy drops to 58.5%. 

(Google Colab link) See also Appendix A4 of Karvonen et al. 

You can train an unbiased classifier just by deleting gender-related tokens from the data. 

We removed gender-related tokens (pronouns, possessive pronouns, and gendered honorifics) from the Bias in Bios dataset. We then trained a linear probe on layer 1 of pythia-70m-deduped, mean-pooling over all token positions, and achieved an out-of-sample accuracy of 93.7% and a worst group accuracy of 89.2%. (If we follow the paper and train on layer 4, we get an accuracy of 86.5%. Not great, but still better than the biased baseline).

(Google Colab link)

In fact, for some models, you can train directly on the embedding (and de-bias by removing gender-related tokens)

Classifying profession on the Bias in Bios dataset is easy when we mean-pool over all token positions. We were able to train performative probes right after the embedding layer on a different model (training on pythia embeddings worked less well). Our oracle probe (i.e., trained on the original unbiased dataset) achieves an accuracy of 89%. Training on the biased dataset without gender-related tokens achieves an out-of-sample accuracy of 92.7% on the unbiased dataset. (For this specific seed, the de-biased probe achieved higher accuracy compared to the oracle probe.)

(Google Colab link)

If not BiB, how do we check that SHIFT works?

We see two safety-relevant applications for SHIFT. The first is improving our classifiers and reward models by removing their reliance on undesirable features. The second is direct cognition-based oversight of powerful models/DBIC. In both cases, SHIFT would need to demonstrate that it provides some utility above other simpler methods such as token–level ablations or linear probes. Preferably, it could show improvements that justify the extra cost and complexity of training SAEs. We sketch some possible experiments below.

SHIFT applied to classifiers and reward models

We could try to improve probe performance in situations where they need to rely on high-level features computed by the model (as opposed to token-level features that are easily extracted from the data). For example, if someone used SHIFT to make a better deception probe in Goldowsky-Dill et al. (2025)’s setting, that would be conclusive proof that SHIFT (and SAEs in general) are useful in downstream tasks. 

We are also excited about using SAEs to find the most impactful features used in reward models and see if we could ablate or scale down irrelevant features (see e.g., Riggs and Brinkman’s work on interpreting preference models) to obtain, say, less sycophantic models. This idea was proposed in the original paper, but we haven’t seen anyone try it.

SHIFT for cognition-based oversight/disambiguating behaviorally identical classifiers

When a policy achieves high rewards in training, what type of cognition is behind that policy? We think that there could be models that behave identically in training but have different internal goals (e.g., “Do what the human wants” vs. “Do what looks good to humans”) and will act differently during deployment. One proposed application of SHIFT is cognition-based oversight and disambiguating behaviorally identical classifiers (DBIC) (Marks 2024). In short, we look for SAE latents that are implicated in downstream behavior and try to infer the cognitive state of the AI using those SAE latents. This way, we could potentially disambiguate between behaviorally identical models. 

Future studies could try applying SHIFT to distinguish between regular models with model organisms of misaligned AI, such as reward hackers or situationally aware models that act differently during deployment. We should also benchmark SHIFT’s performance against other tools that can be used for DBIC such as linear probes and crosscoder-based model diffing approaches. 

Next steps: Focus on what to disentangle, and not just how well you can disentangle them

Researchers have focused on how well a method can disentangle concepts, but we believe they should think harder about what concepts they want to disentangle in the first place. Past studies have tried to separate concept pairs like gender & profession, review sentiment & review product category, and Tokyo & being a city in Asia[2]. These concepts are pretty far from the ones we would want to separate for safety reasons (i.e., doing what the human wants & doing what looks good to the human). Our methods are sufficiently advanced, and short timelines are sufficiently likely that we should studies these methods in safety-relevant settings now[3]

Thanks to Adam Karvonen, Alice Rigg, Can Rager, Sam Marks, and Thomas Dooms for feedback and support.

  1. ^

    The recent alignment auditing paper from Anthropic is an example of interpretability applied to safety, but it's not very similar to SHIFT.

  2. ^

    This is from the RAVEL benchmark. We feel like RAVEL takes a stance on which concepts should be “atomic” in an LLM and tries to disentangle things that we're not sure need to be disentangled. 

  3. ^

    This is different from applying them during training and deployment, which has its own issues.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SHIFT方法 Bias in Bios SAE 模型去偏见
相关文章