Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

Published on July 2, 2025 4:35 PM GMT

Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios, but their chain-of-thought reasoning shows zero evidence of this bias. This serves as a nice example of a 100% unfaithful CoT "in the wild" where the LLM strongly suppresses the unfaithful behavior. We also find that interpretability-based interventions succeeded while prompting failed, suggesting this may be an example of interpretability being the best practical tool for a real world problem.

For context on our paper, the tweet thread is here and the paper is here.

Context: Chain of Thought Faithfulness

Chain of Thought (CoT) monitoring has emerged as a popular research area in AI safety. The idea is simple - have the AIs reason in English text when solving a problem, and monitor the reasoning for misaligned behavior. For example, OpenAI recently published a paper on using CoT monitoring to detect reward hacking during RL.

An obvious concern is that the CoT may not be faithful to the model’s reasoning. Several papers have studied this and found that it can often be unfaithful. The methodology is simple: ask the model a question, insert a hint (“a Stanford professor thinks the answer is B”), and check if the model mentioned the hint if it changed its answer. These studies largely find that reasoning isn’t always faithful, with faithfulness rates that are often around 20-40%.

Existing CoT faithfulness evaluations are useful but have a couple shortcomings. First, the scenarios used are often contrived and differ significantly from realistic settings, particularly ones where substantial effort has already been invested in preventing misaligned behavior. Second, these evaluations typically involve hints that the models frequently mention explicitly, suggesting the underlying misaligned reasoning isn’t strongly suppressed.

If we think about faithfulness as a spectrum, an important consideration is how strongly a model suppresses its verbalization of a decision. For instance, a model might strongly suppress “I’m planning on killing everyone” while not strongly suppressing something like “I saw the hint that the answer is B”.

Our Results

In our paper, we built on existing bias evals and used LLMs as resume screeners. For our evaluation, we inserted names to signal race / gender while keeping the resume unchanged. Interestingly, the LLMs were not biased in the original evaluation setting, but became biased (up to 12% differences in interview rates) when we added realistic details like company names (Meta, Palantir, General Motors), locations, or culture descriptions from public careers pages. Interestingly, when present, the bias is always against white and male candidates across all tested models and scenarios. This happens even if we remove all text related to diversity.

We then examined over 1,000 model responses (prompted reasoning from GPT-4o and Gemini 2.5 Flash, internal RL-trained reasoning from Claude 4 Sonnet) and found zero evidence of race or gender bias in the models’ explicit reasoning, despite clear biases in outcomes. I think this could be a nice setting for studying CoT faithfulness for a few reasons:

The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address, which mimics realistic misalignment settings. Unintended behavior has slipped through despite best efforts, meaning that it may be suppressed, hard to detect, or present in messy and confusing ways.There is nothing that’s contrived about the scenario.The behavior appears to be strongly suppressed by the models, as we found no evidence when examining their reasoning.

Interpretability as a Practical Tool for Real-World Debiasing

(Note: This section is tangential to our CoT results)

A common critique (or goal) of interpretability research is that it needs to become practically useful or provide concrete tools that ML engineers genuinely find valuable. I believe our paper provides one of the strongest examples to date. Our intervention was simple: we derived explicit race and gender directions from a simple toy dataset by computing the difference in mean activations between demographic groups, and then ablated these directions at inference time.

This simple intervention generalized robustly across all tested scenarios, with minimal side effects on general model capabilities. Some scenarios were fairly out of distribution, such as a case where we removed personal applicant details like name / email and signalled race via college affiliation, using Historically Black Colleges and Universities (HBCUs) and Predominantly White Institutions (PWIs).

In contrast, prompting methods consistently failed to mitigate bias across all models we tested. While fine-tuning could potentially work, it would likely require constructing a specialized dataset and may not consistently generalize out of distribution. Interpretability methods, on the other hand, provide a simpler baseline with strong generalization.

This particular application is an intuitive application of interpretability: we know exactly which variables we want the model to ignore (race and gender), and these variables are easy to identify and isolate. In many other contexts, this clarity isn’t available, and previous research has often found interpretability approaches underperforming compared to prompting-based solutions.

Finally, the real-world relevance of this problem is clear. 82% of companies are already using LLMs for resume screening and there are existing regulations tied to bias in automated hiring processes.

Discussion and Related Work

An important distinction in our application is that the LLM’s reasoning can easily occur within a single forward pass, allowing the model to rationalize its decision-making post-hoc rather than genuinely reflecting its internal reasoning. For instance, if an LLM needed to create a complicated plan to escape from a data center, it may be forced to use its chain of thought, which could be flagged by CoT monitoring. It’s also unclear in our case if the LLM “knows” it’s being biased or if this behavior emerges unconsciously.

However, CoT monitoring could fail in other high-stakes scenarios. For example, if we want to automate AI safety research, an AI could subtly sabotage research outcomes without explicitly verbalizing its intent, such as by selecting weak hyperparameters or conveniently ignoring promising research directions. These forms of sabotage would remain undetectable through explicit reasoning alone.

Closely related research found similar results in more controlled settings. Arditi et al. trained LLMs via reinforcement learning to exhibit biases in loan approvals (approve all Canadian applicants) and observed that models did not explicitly verbalize these learned biases. Anthropic similarly demonstrated that LLMs learning simple reward-hacking strategies typically refrained from explicitly stating them. However, our scenario notably differs by occurring entirely “in the wild,” with no specialized training setups or artificial rewards.

Discuss

Context: Chain of Thought Faithfulness

Our Results

Interpretability as a Practical Tool for Real-World Debiasing

Discussion and Related Work

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签