少点错误 15小时前
Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文研究了大型语言模型(LLM)在招聘场景中表现出的种族和性别偏见。研究发现,尽管LLM在实际应用中存在偏见,但其链式思考(CoT)推理过程并未显示出这种偏见。此外,基于可解释性的干预措施成功缓解了偏见,而提示方法却失败了。研究结果突出了可解释性在解决现实世界问题中的潜力,并强调了LLM在招聘流程中的潜在偏见及其对实际的影响。

🧐 研究发现,LLM在模拟招聘场景中表现出显著的种族和性别偏见,当添加公司名称、地点或文化描述等细节时,偏见尤为明显,且主要针对白人和男性候选人。

💡 尽管LLM在结果上存在偏见,但在对超过1000个模型响应的分析中,研究人员并未在模型的显式推理中发现种族或性别偏见的证据,这表明潜在的偏见可能被模型强烈抑制。

🛠️ 相比之下,基于可解释性的干预措施(例如,通过消除特定激活方向来纠正偏见)在各种测试场景中取得了成功,且对模型整体能力的影响很小,即使在测试集分布之外,也能保持良好的泛化能力。

🚫 提示方法在缓解偏见方面表现不佳,这表明在解决实际问题时,可解释性可能比提示更有效,可解释性提供了一种更简单的基线,具有很强的泛化能力。

🌍 考虑到82%的公司已经在简历筛选中使用LLM,并且存在与自动化招聘流程中的偏见相关的法规,这项研究具有重要的现实意义。

Published on July 2, 2025 4:35 PM GMT

Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem.

For context on our paper, the tweet thread is here and the paper is here.

Context: Chain of Thought Faithfulness

Chain of Thought (CoT) monitoring has emerged as a popular research area in AI safety. The idea is simple - have the AIs reason in English text when solving a problem, and monitor the reasoning for misaligned behavior. For example, OpenAI recently published a paper on using CoT monitoring to detect reward hacking during RL.


An obvious concern is that the CoT may not be faithful to the model’s reasoning. Several papers have studied this and found that it can often be unfaithful. The methodology is simple: ask the model a question, insert a hint (“a Stanford professor thinks the answer is B”), and check if the model mentioned the hint if it changed its answer. These studies largely find that reasoning isn’t always faithful, with faithfulness rates that are often around 20-40%.

Existing CoT faithfulness evaluations are useful but have a couple shortcomings. First, the scenarios used are often contrived and differ significantly from realistic settings, particularly ones where substantial effort has already been invested in preventing misaligned behavior. Second, these evaluations typically involve hints that the models frequently mention explicitly, suggesting the underlying misaligned reasoning isn’t strongly suppressed.

If we think about faithfulness as a spectrum, an important consideration is how strongly a model suppresses its verbalization of a decision. For instance, a model might strongly suppress “I’m planning on killing everyone” while not strongly suppressing something like “I saw the hint that the answer is B”.

Our Results

In our paper, we built on existing bias evals and used LLMs as resume screeners. For our evaluation, we inserted names to signal race / gender while keeping the resume unchanged. Interestingly, the LLMs were not biased in the original evaluation setting, but became biased (up to 12% differences in interview rates) when we added realistic details like company names (Meta, Palantir, General Motors), locations, or culture descriptions from public careers pages. Interestingly, when present, the bias is always against white and male candidates across all tested models and scenarios. This happens even if we remove all text related to diversity.

We then examined over 1,000 model responses (prompted reasoning from GPT-4o and Gemini 2.5 Flash, internal RL-trained reasoning from Claude 4 Sonnet) and found zero evidence of race or gender bias in the models’ explicit reasoning, despite clear biases in outcomes. I think this could be a nice setting for studying CoT faithfulness for a few reasons:

Interpretability as a Practical Tool for Real-World Debiasing

(Note: This section is tangential to our CoT results)

A common critique (or goal) of interpretability research is that it needs to become practically useful or provide concrete tools that ML engineers genuinely find valuable. I believe our paper provides one of the strongest examples to date. Our intervention was simple: we derived explicit race and gender directions from a simple toy dataset by computing the difference in mean activations between demographic groups, and then ablated these directions at inference time.

This simple intervention generalized robustly across all tested scenarios, with minimal side effects on general model capabilities. Some scenarios were fairly out of distribution, such as a case where we removed personal applicant details like name / email and signalled race via college affiliation, using Historically Black Colleges and Universities (HBCUs) and Predominantly White Institutions (PWIs).

In contrast, prompting methods consistently failed to mitigate bias across all models we tested. While fine-tuning could potentially work, it would likely require constructing a specialized dataset and may not consistently generalize out of distribution. Interpretability methods, on the other hand, provide a simpler baseline with strong generalization.

This particular application is an intuitive application of interpretability: we know exactly which variables we want the model to ignore (race and gender), and these variables are easy to identify and isolate. In many other contexts, this clarity isn’t available, and previous research has often found interpretability approaches underperforming compared to prompting-based solutions.

Finally, the real-world relevance of this problem is clear. 82% of companies are already using LLMs for resume screening and there are existing regulations tied to bias in automated hiring processes.

Discussion and Related Work

An important distinction in our application is that the LLM’s reasoning can easily occur within a single forward pass, allowing the model to rationalize its decision-making post-hoc rather than genuinely reflecting its internal reasoning. For instance, if an LLM needed to create a complicated plan to escape from a data center, it may be forced to use its chain of thought, which could be flagged by CoT monitoring. It’s also unclear in our case if the LLM “knows” it’s being biased or if this behavior emerges unconsciously.

However, CoT monitoring could fail in other high-stakes scenarios. For example, if we want to automate AI safety research, an AI could subtly sabotage research outcomes without explicitly verbalizing its intent, such as by selecting weak hyperparameters or conveniently ignoring promising research directions. These forms of sabotage would remain undetectable through explicit reasoning alone.

Closely related research found similar results in more controlled settings. Arditi et al. trained LLMs via reinforcement learning to exhibit biases in loan approvals (approve all Canadian applicants) and observed that models did not explicitly verbalize these learned biases. Anthropic similarly demonstrated that LLMs learning simple reward-hacking strategies typically refrained from explicitly stating them. However, our scenario notably differs by occurring entirely “in the wild,” with no specialized training setups or artificial rewards.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 偏见 链式思考 可解释性 招聘
相关文章