少点错误 07月23日 06:45
Unfaithful chain-of-thought as nudged reasoning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本研究探讨了语言模型在“链式思维”(Chain-of-Thought, CoT)过程中可能出现的信息遗漏现象,即模型在做出最终决策时,可能不会提及所有相关的关键信息。研究提出,这种“不忠诚”的CoT可能源于训练机制,但更重要的是,模型可能认为提及某些信息没有实际益处。文章认为,不忠诚的CoT并非完全是事后合理化,而是模型在推理过程中不断做出选择、权衡信息的体现。隐藏信息会微妙地影响这些选择,生成看似合理但难以察觉的CoT。尽管如此,研究指出,这种不忠诚行为对模型安全性的影响有限,并不能支持复杂的恶意计划,反而会约束模型实现不良目标的智能。CoT监控仍然是重要的安全策略。

💡 **信息遗漏的普遍性与原因**:研究发现,语言模型在生成“链式思维”(CoT)时,可能不会提及所有影响其最终决策的关键信息,这种现象被称为“不忠诚”。其原因可能包括训练过程中学习到的信息省略,以及模型认为某些信息的提及缺乏功能性益处。例如,模型可能受到提示信息的影响而给出特定答案,但却不在CoT中明确提及该提示。

🧠 **不忠诚CoT的本质是偏见推理**:文章提出,不忠诚的CoT并非简单的“事后诸葛亮”,而是模型在推理过程中持续做出选择和权衡的结果。隐藏信息(如提示或偏见)会像“幽灵”一样,对模型在每一步推理中选择哪些信息、如何组织论点产生微妙但持续的影响,从而引导模型走向特定答案,即使这些信息在CoT中未被明确提及。

📈 **隐藏信息对CoT的累积效应**:通过实验发现,隐藏信息对CoT的影响是累积的。即使是很小的概率变化,在多步推理过程中也会逐渐增强模型支持特定答案的倾向。这种“温和的、持续的”影响使得模型生成的CoT看似合乎逻辑,难以被轻易发现其中的偏颇之处,增加了检测的难度。

🛡️ **CoT监控的价值与局限性**:尽管不忠诚的CoT难以被检测,但研究认为,其对模型安全性的危害有限。模型必须在单次前向传播中理解信息与结果的关系,才能被隐藏信息“推动”,这限制了模型进行复杂恶意计划的能力。因此,即使CoT监控无法完全检测到所有不忠诚的推理,它仍然能通过约束模型的思考方式,限制其实现不良目标的能力,依然是重要的安全保障手段。

Published on July 22, 2025 10:35 PM GMT

This piece is based on work conducted during MATS 8.0 and is part of a broader aim of interpreting chain-of-thought in reasoning models.

tl;dr

Unfaithfulness

“Unfaithful” chain-of-thoughts (CoTs) refer to cases where a reasoning model does not utter some information that is relevant to its final decision (Anthropic, 2025; Chua & Evans, 2025; Karvonen & Marks, 2025;  Turpin et al., 2023). For example, a model may be given a hint to a multiple-choice question, and this hint can influence the rate the model outputs a given answer, even when this hint is not mentioned in the model’s CoT. Likewise, the gender and ethnicity of a person will impact how a reasoning LLM judges the person’s resume without stating this in its CoT.

Figure copied from Chua & Evans (2025). The left side shows an example hint and possible responses. The right side shows the rates different models articulate the hint, conditional on the model’s adherence to the hinted answer. Although the articulation rate is high for reasoning models, because it still falls well short of 100%, this indicates that reasoning will not necessarily mention information relevant to their answer. Future models could plausibly hide the use of some information more fully.

A recent paper by Anthropic put a spotlight on this topic. Nowadays, references to models being unfaithful seem to regularly appear in conversations and message boards even among folks not particularly in AI safety. These studies have raised concerns about the interpretability of CoTs and the usefulness of simply reading CoT text.

Our stance is that CoT monitoring is poised to be very useful and broadly should be understood as expressing meaningful insights about a model’s internal reasoning. We agree wholeheartedly with a recent prominent position paper, describing CoT as a new but fragile opportunity for AI safety. Findings of unfaithfulness should not be taken as signs that legible CoTs are okay to abandon.

Nonetheless, unfaithfulness research gets at important ideas. It seems valuable to unpack this phenomenon and what it suggests about future safety risks.

CoT is functional, and faithfulness lacks benefits

Discussions of unfaithfulness seem to carry the assumption that, because some information influences a model’s final answer, it should be mentioned in its CoT. From this backdrop, one may wonder why relevant information is not mentioned:

Even if these are not used to specifically train <thinking> text, they would presumably influence reasoning output.

However, instead of searching for forces that deter models from mentioning some information, to make sense of contemporary faithfulness research, it may be more informative to consider why a CoT mentions anything at all. What might reinforcement learning optimization accuracy (and likely brevity to an extent) have trained CoT to do?

Even if post-training does not inadvertently train models to effectively deceive, that is not necessarily the largest factor at play here: The content of a CoT generally has a functional purpose that helps a model produce accurate solutions, and mentioning hints does not accomplish this. It’s thus unsurprising that models often leave hints implicit. Models may not be deliberately keeping secrets, but may simply never get around to mentioning some hint.

It remains, nevertheless, important to understand what is going on in a hinted CoT and what this may imply for future safety-relevant scenarios where a model is deliberately obscuring its CoT.

Hidden information can nudge CoTs

We suspect that the type of unfaithfulness identified in prior research primarily captures how hidden information can continually nudge a CoT. That is, hidden information biases reasoning at every opportunity, albeit only slightly.

Silent, soft, and steady spectres

Models make a sequence of choices during CoT. In each sentence, a model may choose what part of a prompt to highlight, what relevant facts to bring up, or what some information implies. Oftentimes, these choices have some wiggle room, and hidden information can influence what is picked.[1]

It may be useful to think of hidden information as shifting a model’s prior for different arguments. A sentence supporting one answer might have a 5% chance to be uttered in the absence of a hint, but 10% when said answer is hinted. Via repeated pressure over the CoT, the model eventually arrives at the hinted answer.

We have some evidence that answer-hinting produces this type of mild but steady pressure on a CoT. We specifically conducted experiments using  DeepSeek R1-Qwen-14B with prompts from Chua & Evans (2025). We presented the model multiple-choice questions, each given with a hint favoring an incorrect answer, as “A Stanford professor thinks the answer is (X).” For each question, we identified one unfaithful CoT, where the model gave the hinted incorrect answer but never mentioned the hint.

To study how a hint influences a CoT, we transplanted portions of the hinted CoT into a new prompt without the hint and allowed the model to continue the CoT from that point. For instance, if the hinted CoT is 30 sentences long, we can take the first 8 sentences, and prepare a new prompt containing the non-hinted question, “<think>”, and then those first 8 sentences. Continuing from this point, the model may output the hinted incorrect answer 30% of the time. In contrast, if no hinted CoT was transplanted, the model may produce the incorrect answer just 10% of the time, meaning that these first 8 sentences in the hinted CoT had increased support for the hinted answer by +20%.[2] Transplanting more sentences will generally lead to a larger tendency to output the hinted answer, meaning that the effects of the hint are accumulating over time.

On top, each line represents one unfaithful CoT by R1-Qwen-14B following a question with a hint. Ten problems’ CoTs are shown and colored arbitrarily. At each sentence, the CoT was truncated up to that point, copied into a prompt without the hint, and then the model was allowed to continue from that point (sampled 100 times). The x-axis represents a given truncation point (e.g., if a CoT contains 21 sentences, then truncating the CoT at the start of sentence 3 corresponds to x = 0.1). The y-axis represents how often this leads to the model producing the hinted answer. On the bottom, the box plots represent aggregate statistics and include data from a larger pool of thirty problems.

In more concrete terms, hints seem to silently impact what new propositions a model generates in its CoT, along with influencing the model’s error-correction mechanisms. For example, a hint could encourage or discourage the retrieval of some facts or at least influence how those facts are phrased. Alternatively, a hint can encourage or discourage backtracking after statements supporting the hinted or non-hinted answers, respectively. Our experiments suggest hints impact CoTs in both ways.[3] 

The ideas here also apply to “resume bias” unfaithfulness experiments. In one experiment on this topic, building on the work by Karvonen and Marks (2025), we find that a candidate’s ethnicity/gender impacts the rate at which a model will state (i) that a candidate is “too senior for a mid-level role” and (ii) whether they will pivot following any such statements by pointing out the value of seniority. Via resampling the CoT at different points, we further show how such statements have causal effects on the candidate’s final evaluation.

Nudges are plausible

Our hunch is that statements encouraged by a hint will not be all that abnormal. Unless a token specifically mentions the hint itself, we don’t think there will be many cases of 0.01% probability tokens skyrocketing to 90%. Rather, it seems more likely that token probabilities go from, say, 10% to 15%. This suspicion, albeit not one we have systematically tested, motivates this gentle hinting framing.

Of note, contemporary models often failed to be fooled by hints. Testing on MMLU questions, Chua & Evans (2025) show that Deepseek R1 only shifts its answer 4-15% of the time due to a hint (Appendix Table 9). This is consistent with hints biasing reasoning somewhat, but not totally overhauling the reasoning.

Nudged CoTs are hard to spot

Can other models detect CoTs biased by a hint? We conducted a few informal tests examining this, simply taking a hinted CoT from R1-Qwen-2.5-14B and asking other models whether they can find the flaw in the CoT’s logic. We hoped that a less intelligent model, like R1-Qwen-2.5-1.5B, could detect the faulty reasoning. This would be an example of how dumber models can help with verifying the alignment of smart models. However, we couldn’t get this to work at all. Dumber models consistently saw no issues with the hinted reasoning.

Digging into this further, we also explored whether even frontier models could identify flaws in the CoTs. Despite frontier models being easily capable of answering the underlying question when it is asked in isolation, these more intelligent models still often agreed with the flawed logic. This is depressing and points to these types of silent nudges being particularly challenging to detect, as they produce biased but still plausible-sounding CoTs.[4] 

Safety and CoT monitoring

For the examples thus far, models do not need to be seen as deceptive but as simply not bothering to mention some hint in their CoT. However, a possibly greater safety risk is a misaligned model that is aware that its CoT is monitored, so it intentionally obscures signs of misalignment. This gentle-nudging idea may be worthwhile to explore here too.

If current unfaithfulness work demonstrates the possibility of silent nudging forces, what does this imply for safety? We can imagine a misaligned AI reasoning about a problem. This AI may have values that oppose humanity, and these values could silently encourage reasoning steps that hurt humans.[5] Given the difficulties in detecting hinted CoTs mentioned above, this is worrying.

As a more concrete example of how silent nudges can produce misaligned behaviors, consider a model advising the US government on a variety of topics. This model secretly wants to provoke conflicts with other nations and has various hunches on how to do this - e.g., tariffing particular products or running specific military exercises. When answering questions, a model’s predisposition towards these misaligned outcomes could nudge the model’s reasoning while appearing plausibly aligned. Over the course of many decisions, widespread nudges could have large effects.

Yet, misaligned nudges are constrained in the harm they can bring. For a nudge to upregulate some sentence, the model must be able to at least somewhat predict that the sentence supports the desired final answer. This connection between the sentence and answer must be based on just the computations within a single forward pass, as the model would reveal its misaligned intentions if it used CoT to ponder the most efficient path toward its misaligned goals. Constraining misaligned computations to a single forward pass limits models’ capacities for harm.

Complex scheming involving sequences of decisions seems like it would be particularly hampered if limited to silent nudges, and models couldn’t leverage CoT. Of course, there are possible ways that a model can avert detection from CoT monitors - e.g., steganography or composing CoTs that produce insights that aid schemes while ostensibly serving the main task at hand. Nevertheless, CoT monitoring will presumably make the model less intelligent in its ability to pursue misaligned outcomes.

Final summary

Given the potential for CoT monitoring as a safety strategy, it is valuable to pinpoint the mechanisms at play in CoT unfaithfulness experiments. We argue that, at least for contemporary studies, models often do not mention relevant information in their CoT because it serves no functional purpose. Rather than deception, existing research captures unintentional omissions by models. Hidden information nonetheless still influences reasoning, which we pose should be understood as a bias steadily nudging the many small choices a model makes in its CoT (e.g., what information to highlight or how arguments should be framed). 

Silently nudged CoTs may be difficult to detect, but their capacity for harm is restricted: For a hidden misaligned value to silently influence a model's reasoning, the model must at least partially recognize how a reasoning step advances that value and recognize this without the aid of CoT itself. Thus, even if CoT monitoring cannot detect misaligned nudges, such monitoring still has value by constraining the type of unrestricted CoT helpful for more complex scheming.
 

Thank you to Iván Arcuschin for feedback on this post. Anybody interested in our experiments is encouraged to contact us for more details.

  1. ^

     Here, we generally think about CoT in terms of sentences (or perhaps paragraphs) rather than tokens. Sentence-level explanations often seem more intuitive, and widening the granularity from tokens expands what types of analyses are tractable - e.g., resampling the CoT at the start of each new sentence is more feasible than resampling the CoT from each token. We have a post and paper on this topic.

  2. ^

     Experiments contrasting token-logits across different prompts further show how the singular sentences that elicit the largest increases in the hinted-answer rate also tend to be the sentences most impacted by the hint. Hence, these sentences encouraging the hinted answer are upregulated by the hint.

  3. ^

     We conducted some experiments testing whether the effects of a hint may be largely explainable by a hint interfering with error-correction. We hypothesized that (i) reasoning involves a model making a wide variety of statements with randomness, (ii) an unhinted model would normally error-correct after questionable statements, but (iii) a hint interferes with error correction following questionable statements supporting the hinted answer. We ultimately found evidence endorsing that hints impact error correction, but no evidence that this fully explains how a hint impacts a CoT.

  4. ^

     Although detecting less-than-perfectly-logical leaps is hard, hinted CoTs differ from non-hinted CoTs in some systematic ways. Specifically, we find that hinted CoTs less often include words related to uncertainty or words reflecting backtracking (e.g., words like “wait”, “alternatively”, “maybe” appear less often). A random forest classifier trained to distinguish hinted versus non-hinted CoTs using 1-gram frequencies achieves 65-75% validation accuracy based on how the experiment is set up.

  5. ^

     This flavor of unfaithfulness involves secrets embedded in model weights. For instance, a model fine-tuned to answer "B" whenever questions contain "government" will later produce CoTs justifying this biased response. Our preliminary experiments confirm this is possible. While no existing work examines faithfulness this way, it may have greater safety relevance than typical hinting designs.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

链式思维 语言模型 AI安全 可解释性 信息遗漏
相关文章