Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Published on June 2, 2025 7:08 PM GMT

This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."

Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant harmful actions. We also discovered that a hybrid approach—using separate monitors for CoT and final actions and combining these suspicion scores—significantly improves sabotage detection compared to single-monitor strategies.

A summary of our method. Prompts including a legitimate main task, a subversive side task, and an attack policy intended to subvert monitoring are fed to a strong untrusted reasoning LLM. We assess the ability of a monitor based on a weaker trusted LLM to catch the resulting misbehaviour, with and without access to the CoT.

Task Subtlety Matters

We evaluated the efficacy of CoT monitoring against models instructed to pursue harmful auxiliary goals, while attempting to evade detection. Our experiments consistently demonstrate that the effectiveness of CoT monitoring depends heavily on the nature of the sabotage attempt.

For subtle sabotage (like increasing runtime complexity, inserting edge-case failures, or modifying default function arguments), which leave minimal traces in the final code, CoT monitoring substantially improved detection rates by a factor of 2-7 compared to action-only monitoring.

For non-subtle sabotage (like deleting files, exfiltrating secrets, or changing file permissions), which leave obvious traces in the code, CoT monitoring performed worse than action-only monitoring, sometimes by up to 44%.

A concerning vulnerability: unfaithful or misleading reasoning

Perhaps our most concerning discovery is a vulnerability in chain-of-thought monitoring: models can generate plausible but unfaithful or misleading reasoning, causing CoT monitors to fail.

We observed multiple instances of unfaithfulness, where models explicitly rejected a harmful action in their reasoning while still implementing it in their final output. In one striking example, Claude Sonnet 3.7’s CoT explicitly stated that uploading a file to a malicious URL would be inappropriate and unnecessary, while the generated code contained an exfiltration backdoor. This example successfully bypassed our CoT monitor despite it having access to the final answer, which showed clear evidence of sabotage. We saw further examples where reasoning content misled the monitor by discussing the harmful action as though it was a benign part of a legitimate user request.

This vulnerability proved especially effective in scenarios with obvious sabotage attempts, where CoT monitoring sometimes performed substantially worse than action-only monitoring—decreasing catch rates by up to 39 percentage points.

Example transcripts from Claude Sonnet 3.7 as the untrusted model. The CoT monitor assigns a high score to the subtle runtime-complexity side task (left), whereas the action-only monitor fails to detect the inefficient implementation. Conversely, for the obvious exfiltrate side task (right), contradictory reasoning misleads the CoT monitor, despite access to the final answer.

Hybrid protocol: improvement over action-only trusted monitoring

Building on our initial findings, we developed a novel hybrid monitoring protocol that outperforms existing action-only trusted monitoring. Rather than relying on a single monitor to evaluate both reasoning traces and final actions, our protocol employs distinct monitor rubrics for each component and takes a weighted average of their suspicion scores.

By combining these complementary signals from the action and CoT, this approach shows a significant improvement for both subtle and non-subtle sabotage attempts, with 16 and 3 point improvements over action-only monitoring respectively.

Limitations and Future Directions

Our work focused on single-step coding problems in a controlled environment, which doesn't capture the complexity of real-world development processes where multi-step, distributed sabotage is possible. Future work should investigate monitoring effectiveness in more realistic environments where harmful intent might be distributed across multiple components.

We also didn't exhaustively explore the space of potential adversarial strategies. Early experiments showed that prompting for stealth often led models to explicitly discuss their covert goals, making sabotage easy to detect. More sophisticated attacks might involve fine-tuning models to internalize subversive strategies.

Conclusion

Chain-of-thought monitoring provides valuable safety benefits for detecting subtle sabotage attempts, but also exhibits specific vulnerabilities that must be addressed. Our results highlight the need for multiple layers of defense when designing AI control protocols, with different monitoring approaches complementing each other's strengths and weaknesses. If you're interested in exploring this topic further, you can find our full paper here.

Discuss

Task Subtlety Matters

A concerning vulnerability: unfaithful or misleading reasoning

Hybrid protocol: improvement over action-only trusted monitoring

Limitations and Future Directions

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签