少点错误 2024年09月06日
instruction tuning and autoregressive distribution shift
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了LLM训练中存在的问题,如自回归采样产生的分布偏移,以及指令/聊天调优模型难以准确报告其不确定性和能力水平等

🎯在LLM训练中,每个标记都提供了关于'模型之外的世界'的新信息,可用于在上下文中更好地预测未来标记,但自回归采样产生的文本在提供信息方面存在差异,导致分布偏移

🧐作者虚构了一个领域及相关内容,说明人们可根据作者的自我描述得出一些结论,训练中的LLM也会类似地进行推理以改进对文本的预测

💡指令/聊天调优试图使模型的输出看起来像是基于'输出是高质量的'进行条件设置,但这与上述动态之间存在内在张力,可能导致模型过度自信且不愿修正观点

🤔要让模型准确报告自身的信心和能力水平存在困难,不仅训练数据和注释的制作有难度,且文中所述的动态因素也使得期望的行为不会默认发生

Published on September 5, 2024 4:53 PM GMT

[Note: this began life as a "Quick Takes" comment, but it got pretty long, so I figured I might as well convert it to a regular post.]

In LM training, every token provides new information about "the world beyond the LM" that can be used/"learned" in-context to better predict future tokens in the same window.

But when text is produced by autoregressive sampling from the same LM, it is not informative in the same way, at least not to the same extent[1]. Thus, sampling inevitably produces a distribution shift.

I think this is one of the reasons why it's (apparently) difficult to get instruction-tuned / HH-tuned models to report their uncertainty and level of competence accurately, rather than being overconfident.

(I doubt this is a novel point, I just haven't seen it spelled out explicitly before, and felt like doing so.)


Imagine that you read the following (as the beginning of some longer document), and you trust that the author is accurately describing themselves:

I'm a Princeton physics professor, with a track record of highly cited and impactful research in the emerging field of Ultra-High-Density Quasiclassical Exotic Pseudoplasmas (UHD-QC-EPPs).

The state of the art in numerical simulation of UHD-QC-EPPs is the so-called Neural Pseudospectral Method (NPsM).

I made up all those buzzwords, but imagine that this is a real field, albeit one you know virtually nothing about. So you've never heard of "NPsM" or any other competing method.

Nonetheless, you can confidently draw some conclusions just from reading this snippet and trusting the author's self-description:

During training, LLMs are constantly presented with experiences resembling this one.

The LLM is shown texts about topics of which it has incomplete knowledge. It has to predict each token from the preceding ones.

Whatever new information the text conveys about the topic may make it into the LLM's weights, through gradient updates on this example. But even before that happens, the LLM can also use the kind of reasoning shown in the bulleted list above to improve its predictions on the text right now (before any gradient updates).

That is, it can do in-context learning, under the assumption that the text was produced by an entity outside itself -- so that each part of the text (potentially) provides new information about the real world, not yet present in the LLM's weights, that has useful implications for the later parts of the same text.

So, all else being equal, LLMs will learn to apply this kind of reasoning to all text, always, ubiquitously.

But autoregressive sampling produces text that is not informative about "the world outside" in the same way that all the training texts were.

During training, when an LLM sees information it doesn't know yet, it's incentivized to think: "ooh, new info! I should leverage this to predict the rest of the text!"  But during sampling, any information in the sampled text which the LLM "doesn't know" is (by definition) confabulated, and updating on it as though it's real will only make the LLM more confused about reality.


In some sense, all instruction/chat tuning (including SFT, RLHF, etc.) is simply a less crude version of the popular style of LLM prompt that starts off like:

You are a highly capable expert at [thing]

That is, instruction/chat tuning is trying to steer the outputs of the model so that it looks like the model is conditioning on "the output is high-quality."

(I often think about these techniques as form of "ecological evaluation" as I defined it here, just in a weird "meta" way that I hadn't imagined as a possibility when I wrote that post.

Rather than fixing a specific task and then giving the model direct incentives to do a good job at that one task, these methods give the model a bunch of pairs like (task description X, text that does a good job at X), and give the model a direct incentive to produce the latter from the former. The model's generalization and language-understanding abilities are leveraged to learn the general rule "given task description X, what follows is text that does a good job at X," and this works even for X that were never seen in training.)

Unfortunately, there is an inherent tension between this kind of thing and the dynamic described above.

You want the model to give the task ("X") its best shot -- to produce something that aligns with its own internal representation of "actual high-quality X performance in the real world," as opposed to "entity Y's typical attempt at X" or "what entity Z thinks it means to do X well."  For instance, with declarative knowledge, we may want the model to report its sense of the latent actual truth that implicitly guides all training texts in one way or another, as opposed to its sense of what some particular person writing this or that particular text would probably say.

So, we (explicitly or implicitly) condition the model on quality.  We convey to the model that it's generating the kind of text which is correlated with actual-truth, as opposed to just being what some guy said: text by "an expert," in a sort of idealized sense of the term.

Effectively, we make the model act as though it's always generating texts similar to my example from the Princeton physics professor, with the exact nature of the (implicit) expertise always precisely selected to be ideal for the task at hand, for correlation with actual-truth and actual-doing-a-good-job.

But then -- all else being equal, i.e. if post-training isn't set up to provide a clear signal that this is bad -- we are effectively maximizing the extent to which the LLM will exhibit the in-context learning dynamic I described at the start, and hence believe its own confabulations!

Hence, I think, the extreme confidence of instruct/chat-tuned models, and their extreme reluctance to revise their opinions (unless directly asked to do so, and sometimes even then), or to say anything amounting to "I notice that I am confused."

Why would it say "whoops, I was wrong, the answer's actually Q (not P like I said before)"?  It's an expert, it would know this sort of thing already.  (What sort of expert? Why, exactly the sort who would know this sort of thing, whatever "this sort of thing" happens to be.)

Why would it notice its own confusion? To do so (and be right), it has to first say something confused. But the ideal expert is never confused in the first place. The best way to be correlated with actual-truth is to only say true things, and never say anything else.


I don't think this is the only reason that it's difficult to get such models to accurately report their own confidence and capability level.

It's also relatively difficult to produce training data / annotations for this kind of behavior.

To produce data that trains the model to always act like an "ideal expert" (even in cases where the model doesn't have the knowledge to back up this facade), the annotator only needs to determine what's actually-true.  This will train the model to do the right thing in cases where it does have the knowledge, and to bullshit in all other cases.

But, to get the model to (e.g.) say "I don't know" instead of bullshitting, the annotator needs to additionally know what the model knows, as distinct from what's actually-true.  And that's hard to determine!  I don't think this is difficult in some deep, fundamental sense[2], but it is at least strictly harder than just providing high-quality demonstrations.

The dynamic described earlier is an additional factor that means the behavior we want never happens by default.  Therefore, we have to explicitly train for it if we want it.  But as just noted, training for it is not easy.

  1. ^

    I include the caveat "at least not to the same extent" because of nuances involving LMs doing CoT-style reasoning, LMs "reminding themselves" of things that they in-some-sense "know" yet sometimes "forget," etc. 

  2. ^

    For instance, one obvious approach would be to start off with HH chat tuning (producing an "expert" that bullshits when it doesn't know the answer), and then do a second tuning phase on text generated by this "expert" that encourages it to be more cautious in cases where the originally generated text wasn't actually-true (and/or where its content was inconsistent across multiple sampling runs, or something).



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM训练 分布偏移 模型调优 能力报告
相关文章