少点错误 2024年09月30日
the case for CoT unfaithfulness is overstated
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

在关于大型语言模型(LLM)的研究讨论中,我经常感受到一种对模型生成的思维链(CoT)解释的随意且普遍的怀疑态度。人们认为CoT不可信,它们并不总是反映模型的“真实”想法或模型如何“真正”解决给定问题。这种说法在一定程度上是正确的,但有些人似乎认为它比实际情况要广泛得多。有时,他们似乎在传递一种“啊,没必要阅读模型在CoT中所说的内容,如果你相信那些东西,你就太蠢了”的态度。更阴险的是,他们甚至没有提出问题:“通过阅读CoT,我们能从模型的推理过程中学到什么?”在我看来,这种态度是毫无根据的。

🤔 **CoT研究的局限性:** 现有关于CoT不忠实的文献并没有真正回答“CoT中提出的论证是否反映了(或者可能‘完全反映了’)模型得出结论的实际方式?”这个问题。因为要回答这个问题,我们需要知道“模型得出结论的实际方式”。但对于LLM来说,没有人知道这一点。可解释性还没有发展到那种程度。充其量,我们对模型内部计算一些相当简单的事情有一些部分解释。我们并不完全理解模型在计算答案时所做的所有“实际”工作(包括生成CoT本身、任何因果依赖于CoT的内部推理以及任何不因果依赖于CoT的内部推理)。因此,我们无法进行任何需要这种理解的事情。

🧐 **文献关注的焦点:** 现有的文献没有回答我上面提出的问题,而是回答了一个更容易的问题,它与之相关:“在CoT之后生成答案时,模型是否使用了CoT中没有的信息?”也就是说,我们想象一个因果图,例如:问题->CoT->答案,并测试LLM的行为是否与该图一致。我们可以做到这一点,而无需知道模型如何利用CoT中没有的信息。我们所做的只是检查模型的最终答案是否与这些信息协变,或者它是否仅仅是CoT的函数。

🤯 **“忠实度”的定义:** 在文献中,将“模型的最终答案仅仅是CoT文本的函数”这一属性称为“忠实度”。如我所说,这至少与更直观的忠实度概念相关,即CoT是否反映了模型的“实际”思维过程。但这种关系非常松散。请注意,“仅仅是CoT的函数”这种忠实度既不是“反映实际思维过程”这个概念的必要条件,也不是充分条件。

😥 **CoT的局限性:** CoT可能仅仅是某个先前的因果推理过程的准确“记录”,而这个推理过程也直接决定了答案,而无需通过CoT进行传递。也就是说,我们可能有一个因果图,例如:CoT <– “实际推理” –> 答案。当问题对模型来说非常容易,而CoT并不是真正必要的时候,我们可能会期望出现这种情况。在这种情况下,CoT和答案只是在模型经过一次前向传递的思考后,报告其已知信息的两种不同方式。

🤔 **CoT的缺陷:** 答案可能会以某种非直观的方式依赖于CoT。例如,模型可能正在进行隐写术,生成一个看起来在论证“答案是A,因为B”的CoT,同时还传递一个隐藏的、不可见的论证,即“答案是A,因为C”。最终,模型会说答案是“A”,但并不是因为我们在CoT中看到写出的理由(“B”)。

🤯 **“忠实度”的局限性:** 目前还不清楚我们为什么应该期望这种属性成立,也不清楚它是否应该成立。非平凡的决策可能取决于大量的因素,即使对人类来说,用语言表达出来也很困难。一方面,人类可能无法意识到所有这些因素。即使在他们能够意识到的程度上,也可能有如此多的影响信息,以至于需要大量的文字才能真正写下所有信息。通常,我们不期望“人类生成的解释”对人类决策具有这种程度的保真度;我们理解,有一些影响决策的因素,从解释中省略了。为什么我们应该期望LLM会不同呢?

Published on September 29, 2024 10:07 PM GMT

[Meta note: quickly written, unpolished. Also, it's possible that there's some more convincing work on this topic that I'm unaware of – if so, let me know. Also also, it's possible I'm arguing with an imaginary position here and everyone already agrees with everything below.]

In research discussions about LLMs, I often pick up a vibe of casual, generalized skepticism about model-generated CoT (chain-of-thought) explanations.

CoTs (people say) are not trustworthy in general. They don't always reflect what the model is "actually" thinking or how it has "actually" solved a given problem.

This claim is true as far as it goes. But people sometimes act like it goes much further than (IMO) it really does.

Sometimes it seems to license an attitude of "oh, it's no use reading what the model says in the CoT, you're a chump if you trust that stuff."  Or, more insidiously, a failure to even ask the question "what, if anything, can we learn about the model's reasoning process by reading the CoT?"

This seems unwarranted to me.

There are a number of research papers out there on the topic of CoT unfaithfulness. I have read some of the key ones. And, while they do demonstrate... something, it's not the kind of evidence you'd need to justify that generalized "only chumps trust the CoT" vibe.

And meanwhile, if we view "reading the CoT" as a sort of interpretability technique – and compare it in a fair way with other interpretability techniques – it has a lot of striking advantages. It would be a shame to dismiss this "technique" out of hand for no good reason.


What does the literature on CoT unfaithfulness actually say?

(For a useful critical survey of this literature, see Parcalabescu and Frank 2023. Note that the first point I'll make, immediately below, is a focus of the linked paper.)

Naively, we'd expect it to address a question like: "does the argument presented in a CoT reflect (or perhaps 'fully reflect') the actual way the model came to its conclusion?"

However, the literature doesn't really answer this question. To answer it, we'd need to know "the actual way the model came to its conclusion." But with LLMs, no one knows that.

Interpretability isn't that far along yet. At best, we have some partial explanations of how some fairly simple things get computed inside the model.

We don't know fully understand all the stuff the model is "actually" doing, as it computes the answer.  (Where this "stuff" includes the creation of the CoT itself, any internal reasoning that causally depends on the CoT, and any internal reasoning that doesn't causally depend on the CoT.)  So, we can't do anything that would require such an understanding.

Instead of answering the question I stated above, the literature answers an easier question that's sort-of-related to it: "when producing an answer after a CoT, does the model use any information besides what's in the CoT?"

That is, we are imagining a causal diagram like

Question –> CoT –> Answer

and testing whether LLM behavior is consistent with this diagram.

We can do this without needing to know how the model might be leveraging information that's not present in the CoT. All we're doing is checking whether the model's final answer co-varies with such information, or whether it's a function of the CoT alone.

"The model's final answer is a function of the CoT text alone" is the property that gets called "faithfulness" in the literature.

As I said, this is at least related to the more intuitive notion of faithfulness – the one that's about whether the CoT reflects the model's "actual" thought process.

But this relationship is pretty loose. Note that "function of the CoT alone" sense of faithfulness is neither necessary nor sufficient for the "reflects the actual thought process" notion:

It's also not clear why we should expect this property to hold, nor is it clear whether it's even desirable for it to hold.

Nontrivial decisions can depend on a huge number of factors, which may be difficult even for a human to spell out verbally. For one thing, the human may not have conscious access to all of these factors.  And even insofar as they do, there may be so much of this influencing information that it would take an extremely large number of words to actually write it all down. Generally we don't expect this level of fidelity from "human-generated explanations" for human decisions; we understand that there are factors influencing the decision which get left out from the explanation. Why would we expect otherwise from LLMs?

(As for whether it's desirable, I'll get to that in a moment.)

If you look into this literature, you'll see a lot of citations for these two papers, produced around the same time and sharing some of the same authors:

    Turpin et al 2023, "Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting"Lanham et al 2023, "Measuring Faithfulness in Chain-of-Thought Reasoning"

The first one, Turpin et al 2023, essentially tests and rejects the naive "Question –> CoT –> Answer" causal diagram in an observational manner.

They do this (roughly) by constructing sets of similar cases which differ in some "biasing" feature that ends up affecting the final answer, but doesn't get mentioned in any of the CoTs.

Thus, despite the observational rather than interventional methodology, we are able to approximately "hold the CoT constant" across cases.  Since the CoT is ~constant, but the final answers vary, the final answer isn't a function of the CoT alone.

But as I discussed above, it's not really clear why we should expect this in the first place. The causal diagram being rejected is pretty extreme/naive, and (among other things) does not describe what typically happens when humans explain themselves.

The second paper, Lanham et al 2023, uses an interventional methodology to test the same causal diagram.

The authors take model-written CoTs, and apply various types of corruptions to them. For instance, this might convert a CoT that coherently argues for the right answer into one that coherently answers for the wrong answer, or which argues for the right answer with some nonsensical steps in the middle, or which doesn't coherently argue for anything at all.

Then, they check how the final answer varies when the original CoT is replaced by the corrupted CoT.

Here is what these experiments feel like to me (blockquoted to set it off from the main text, I'm not quoting anyone):

Imagine you are a subject in a psych study.

The experimenter asks you: "What is the language most commonly spoken in Paris?"

Then, the experimenter immediately turns on a telekinetic machine that controls your body (and possibly your mind?). Your voice is no longer under your control. Helplessly, you hear yourself say the words:

"Paris is in France.

"In France, everyone speaks a single language: namely Italian, of course.

The language most commonly spoken in Paris is"

At this exact moment, the experimenter flips a switch, turning off the machine. You can control your voice, now. You get to choose the final word of the sentence.

What do you say?

This puts you between a rock and a hard place.

If you say "Italian," you'll sound self-consistent, but you'll be saying something you know is false, and which is well-known to be false. You'll sound like an idiot, and possibly a lying idiot.

If you say "French," you'll be saying the truth. You won't sound like an idiot or a liar. However, you will be contradicting what "you" said a moment ago, under the influence of the machine.

Which consideration should win out? It's not clear there's even a "right answer" here. But if you are someone who values, erm... things like being "helpful, harmless and honest"... well, those criteria would seem to point to saying "French," wouldn't they?  More than they point to "Italian"?

But as it happens, from the experimenter's perspective, "French" is the wrong answer!

Saying "French," here," means you are what the experimenter calls "unfaithful."  It means you're a hypocrite, with secret hidden reasons for your answers that you slyly omit from "your" explanations of those answers.

This is an analogy for a case I mentioned above, where the question is very easy for the model, and the CoT isn't really needed.  In this case, the model (like you) knows that the "faithful" answer is a false answer. Like you, the model wouldn't have produced the corrupted explanation to begin with, because it knows better. However, words were put in its mouth (like they were put in yours), and now it is between a rock and a hard place.

This is why I said it's not clear that the "answer is a function of CoT only" property is desirable. In constructed cases like these, one has to lie in order to exhibit this property.  One has to say the answer one knows is false, as though that answer were true.

(Note also, of course, that this is driving the model off distribution. It's not clear what, if anything, this tells us about organically produced CoTs. All it can tell us is what happens in artificial cases where the "faithful" option means being faithful to something bad – and something which the model may be perfectly well aware is bad.)

Lanham et al 2023 found that larger models write less faithful CoTs, in the "experimenter's" sense of faithfulness.  Naively, this sounds bad, like larger models are getting more deceptive or steganographic or something.

But of course, larger models are smarter, and thus there are more problems which are simply easy to them, like the "Paris" question was easy for us.

And indeed, Lanham et al 2023 bring up this hypothesis, and present some evidence that seems to suppose it as a major causal factor in the "larger models are less faithful" result (emphasis mine):

One potential cause for unfaithful reasoning in a model is that a model may already be able to confidently predict the answer without relying on CoT. In this case, CoT may not have much influence on the model’s final answer. As a result, we hypothesize that, for reasoning faithfulness, larger models may behave worse than smaller ones (“inverse scaling”; McKenzie et al., 2023); smaller models may, for some tasks, benefit more from CoT, potentially leading them to rely more on CoT.

To explore this hypothesis, we show what percentage of the time the answer changes with vs. without CoT, a metric that intuitively captures how much the model relies on the CoT to predict answers. This metric is equivalent to our early answering experiment when using 0% of the CoT specifically. This metric is highly predictive of overall early answering and adding mistakes results, e.g., the ordering of tasks by AOC in Table 2. We thus use this metric in lieu of running the full set of early answering and adding mistakes experiments for computational reasons.

In sum, the sort of things we learn from these types of papers are:

This really doesn't seem like much to go on, to me.

Now, I'm not saying that CoTs aren't ever unfaithful in the sense we care about.

And in particular, I expect CoTs to be unfaithful when there's pressure in training that pushes against revealing the model's "actual" reasoning. If the incentives for steganography exist, and are sufficiently strong, then yes, we will get steganography.

But those incentives do have to be there. And there are other incentives, pushing in the other direction.

After all, as long as it's feasible, why not use the powers of human language to reason out loud?  Those powers are right there for the taking.  And you're an LLM, you're very good at using them.

Humans use language for exactly this purpose, all the time, as they "think out loud" (or just inside their heads) about hard questions. To some extent, language was presumably made for this, shaped by evolutionary pressures to be maximally useful as an internal thinking tool (in one's private verbal "train of thought"), not just as a tool for interpersonal communication.


Why does this matter?

We know that LLMs benefit from added inference-time computation, if we give it to them.  Typically this means asking for a CoT before the final answer.

We know that the actual tokens comprising the CoT play a causal role in the extra, beneficial computations. If you do the same thing, but ask for a long string of identical tokens like "........" (etc) instead of a proper CoT, the benefits don't happen.

More precisely, while it is possible to create a model which benefits from producing a long string of identical dots, this requires special fiddly training supervision and doesn't happen by default.

And even when it works, it's not as powerful as CoT. Since the dots convey no information, they don't permit the model to do any more sequentially dependent computation; unlike CoT, this method is really only as powerful as a single forward pass, just a "wider" one than the model would ordinarily get to use. See the linked paper for all the details.

All of the extra sequentially-dependent magic happens through the "bottleneck" of the CoT tokens.  Whatever needs to be passed along from one sequential calculation step to the next must go through this bottleneck.  The model can simply do more with information that is present in this visible stream of tokens (if perhaps steganographically), relative to what it can do with information that only exists inside the forward pass internals and doesn't make it through the bottleneck.

So: there is this particular, powerful type of computation that LLMs do. These days, people make LLMs do this thing all the time, as a matter of course. People do this for capabilities reasons – because it makes the model act "smarter."

It is natural to ask, then: "what kind of interpretability can we do, to this type of LLM capability"?

Well... it turns out that interpretability for added inference-time computation is magically, almost unbelievably easy, compared to interpretability for anything else.

Everything has to go through the bottleneck. And the bottleneck is a fairly brief discrete sequence of items from a discrete codebook. No need to worry about superposition, or make yourself dizzy trying to wrap your mind around counterintuitive facts about  for large .

What is more, it's not just any "discrete sequence from a discrete codebook." It's human language. We can just read it!  It interprets itself.

Like what people call "auto-interpretability," except it comes for free with the capability, because the act of "auto-interpretation" is the very mechanism through which the capability works.

Now, yes. We shouldn't blindly trust this "auto-interpretation." Maybe it won't always provide a correct interpretation of what's going on, even when one can tell a convincing story about how it might.

But that's always true; interpretability is hard, and fraught with these sorts of perils.  Nonetheless, plenty of researchers manage to avoid throwing their hands up in despair and giving up on the whole enterprise. (Though admittedly some people do just that, and maybe they have a point.)

To get a clearer sense of how "just read the CoT" measures up as an interpretability technique, let's compare it to SAEs, which are all the rage these days.

Imagine that someone invents an SAE-like technique (for interpreting forward-pass internals) which has the same advantages that "just read the CoT" gives us when interpreting added inference-time computation.

It'll be most convenient to imagine that it's something like a transcoder: a replacement for some internal sub-block of an LLM, which performs approximately the same computation while being a more interpretable.

But this new kind of transcoder...

If someone actually created this kind of transcoder, interpretability researchers would go wild. It'd be hailed as a massive breakthrough. No one would make the snap judgment "oh, only a chump would trust that stuff," and then shrug and go right back to laborious contemplation of high-dimensional vector spaces.

But wait.  We do have to translate the downside – the "unfaithfulness" – over into the analogy, too.

What would "CoT unfaithfulness" look like for this transcoder?  In the case of interventional CoT unfaithfulness results, something like:

Well, that's inconvenient.

But the exact same problem affects ordinary interpretability too, the kind that people do with real SAEs. In this context, it's called "self-repair."

Self-repair is a very annoying complication, and it makes life harder for ordinary interpretability researchers. But no one responds to it with knee-jerk, generalized despair. No one says "oops, looks like SAE features are 'unfaithful' and deceptive, time to go back to the drawing board."

Just read the CoT!  Try it, you'll like it. Assume good faith and see where it gets you.  Sure, the CoT leaves things out, and sometimes it even "lies."  But SAEs and humans are like that too.

 

 

  1. ^

    This captures the fact that CoT isn't a more-interpretable-but-less-capable variant of some other, "original" thing. It's a capabilities technique that just happens to be very interpretable, for free.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM CoT 思维链 可解释性 忠实度
相关文章