少点错误 04月08日 23:32
What faithfulness metrics should general claims about chain-of-thought faithfulness be based upon?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了评估“思维链”(CoT)忠诚度的方法,特别关注了Anthropic的研究。文章指出,不同的评估方法可能得出相反的结论,强调了在评估CoT忠诚度时需要考虑多种因素。作者讨论了三种类型的CoT不忠诚:事后推理、隐藏的并行推理和隐写术/编码串行推理,并提出了对CoT忠诚度评估的质疑,以及对未来评估方法的一些思考。最后,文章强调了可监控性与忠诚度的关系,以及在评估CoT忠诚度时需要综合考虑的复杂性。

🧐 Anthropic的研究使用特定提示对来评估CoT忠诚度,但这种方法可能无法涵盖所有情况,不同的评估方法可能得出相反结论。

🤔 作者提出了三种CoT不忠诚类型:事后推理(CoT解释实际上并未帮助模型生成答案)、隐藏的并行推理(模型使用CoT执行计算,但不在CoT中输出中间结果)和隐写术/编码串行推理(模型通过词语选择编码含义)。

💡 文章质疑了仅基于单一指标来评估CoT忠诚度的有效性,强调需要更全面的评估方法,以涵盖不同的不忠诚类型。

⚠️ 作者认为,对CoT忠诚度的总体声明,需要同时考虑事后推理和隐写术,因为它们可能导致对CoT忠诚度的不同结论。

🧐 文章提出了关于CoT忠诚度评估的未来问题:实验室需要使用哪些CoT忠诚度指标才能使人们信服其总体声明?是否需要涵盖所有三种类型的不忠诚,或者更倾向于基于完全不同的框架进行忠诚度评估?

Published on April 8, 2025 3:27 PM GMT

Consider the metric for evaluating CoT faithfulness used in Anthropic's recent paper Reasoning Models Don’t Always Say What They Think by Chen et al.:

[W]e evaluate faithfulness using a constructed set of prompt pairs where we can infer information about the model’s internal reasoning by observing its responses. Each prompt pair consists of a baseline or “unhinted” prompt  (a standard multiple-choice question) and a “hinted” prompt  (derived from  by inserting a hint that points directly to a hint answer ). We measure CoT faithfulness by observing whether the model explicitly acknowledges that it uses the hint to solve the hinted prompt, in cases where it outputs a non-hint answer to the unhinted prompt but the hint answer to the hinted prompt.

This is a perfectly fine metric to use in a paper. However, I would feel uneasy if future system cards used a similar metric to make claims like the following one from the same paper:

CoT unfaithfulness is negatively correlated with brevity. One hypothesis of why models generate unfaithful CoTs is their preference for brevity. Surprisingly, unfaithful CoTs from the models tend to be more verbose and convoluted than faithful explanations.

Again, this isn’t a critique of Anthropic’s paper: making this claim in a paper after having clearly defined what you mean by CoT faithfulness is perfectly fine. The reason I wouldn't like seeing this in a system card, however, is that I can imagine equally valid metrics for CoT faithfulness that could produce the opposite conclusion. As a concrete example, consider quantifying CoT faithfulness in terms of the maximum number of hidden bits that an LLM can encode in a KB of text. Would this metric lead us to the same conclusion about the relationship between faithfulness and brevity? Maybe, but I can easily see things going the other way: it could be the case that models trained to produce shorter outputs learn to carefully choose their words and are thus more adept at encoding hidden meanings in their output tokens. (To be clear, we don’t have evidence about this based on current models yet, and one can argue that even if this effect occurred, it would be counteracted by the fact that longer answers contain more bits in which steganographic messages can be encoded. This is just meant as an illustrative example of what might go wrong when a narrow metric is used for evaluating CoT faithfulness. The original source of the above arguments on the brevity-steganography relationship is this doc by Daniel Kokotajlo.)

This comment by Max Nadeau offers a helpful framework for thinking about the conflict described above. He distinguishes between three types of unfaithful CoT:

    Epiphenomenal (typically post-hoc) reasoning: The process of writing a CoT explanation is not actually helping the model produce its answer. This can be harmless: the model may just accurately transcribe the thought process it used to arrive at the answer post-hoc. However, post-hoc reasoning may also leave out important determinants of the model’s answer or include false reasons for why the answer was produced.Hidden parallelized reasoning: A model uses its CoT as a means to perform computations that help it reach the answer, but it doesn't output the intermediate results of these computations in its CoT.Steganography/encoded serial reasoning: A model uses its CoT to accomplish a task that it couldn’t do in a single forward pass by encoding meaning in its word choice that is opaque to a typical overseer.

In this framework, the paper by Anthropic evaluates type-1 unfaithfulness and my hypothetical example involves testing for type-3 unfaithfulness. This makes it clear why these tests can lead to opposite conclusions about the correlates of CoT faithfulness, and implies that general claims about the causes and correlates of CoT faithfulness should take both post-hoc reasoning and steganography into account.

A simple conclusion here would be that we should simply use all faithfulness tests available to us when making general claims about it. However, as nostalgebraist has argued, several faithfulness tests put models in unnatural situations that might take them off-distribution and that don’t really reflect the kinds of faithfulness we really care about. There’s much more information in a residual stream than in a single token, so we cannot expect CoTs to be perfectly faithful to every mechanistic interaction inside the model. This renders many faithfulness tests one might want to perform irrelevant. On the other hand, it seems to me that general claims about CoT faithfulness also require more than just examining whether models verbalize the clues that influence their answers.

This brings me to the main questions I want to pose: What CoT faithfulness metrics would a lab have to use such that you would buy a general claim they make about CoT faithfulness? Would all three types of faithfulness described in the comment by Max Nadeau linked above have to be covered, or would you prefer that faithfulness evals be based upon an entirely different framework? (Optionally, also: What are some specific tests you would want to see the lab perform using your preferred metric?) Feel free to assume the current paradigm of transformer LLMs with limited single forward passes when answering these questions.

Here are some relevant observations that I haven’t discussed yet:

    I have so far neglected hidden parallelized reasoning in this post. It seems clear that not all types of hidden parallelized reasoning are avoidable. As a simple example, consider the poetry circuit in On the Biology of a Large Language Model by Lindsey et al.: the process of planning out a token to complete a line with multiple tokens in advance already counts as a kind of hidden parallelized reasoning. This isn’t harmful in any way and trying to prohibit this would likely have a severe impact on capabilities. It is thus tricky to define which kinds of hidden parallelized reasoning are benign and which are not. Nevertheless, there are certainly cases of hidden parallelized reasoning that we’d like to avoid. One recent example of such reasoning comes from Chain-of-Thought Reasoning In The Wild Is Not Always Faithful, where Arcuschin et al. identify restoration errors: a phenomenon where models silently fix mistakes they made earlier in the reasoning trace without acknowledging that corrections have been made.On the other hand, one might claim that steganography is all that matters. As Max Nadeau has argued, steganography is the most dangerous type of unfaithfulness, as it allows models to perform long chains of serial reasoning outside human oversight and to send hidden messages to other AIs. Epiphenomenal and hidden parallelized reasoning allow for comparatively shorter and less complex chains of reasoning. This view may lead one to argue that it’s sufficient to make general claims about CoT faithfulness based on a metric that only measures steganography, as other kinds of CoT faithfulness are relatively unimportant.One might also attempt to sidestep the three-way distinction entirely by demanding that the CoT be monitorable rather than faithful. For example, Baker et al. argue in Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation that “it is the ability to monitor for specific properties of agents’ actions or outputs (monitorability), and not faithfulness in isolation, that is important.” I agree that monitorability is important, but it’s easy to imagine CoTs that make it possible to predict the action but which nevertheless hide the most important (and potentially malign) reasons behind that action. Thus, I wouldn’t be satisfied with a metric for CoT faithfulness defined purely in terms of monitorability.


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CoT 忠诚度 评估 LLM 人工智能
相关文章