少点错误 07月07日 08:17
The Base Model Lens
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型(LLM)在处理问题时出现“愚蠢”错误的原因。作者认为,这些错误往往源于LLM在预训练阶段学习到的行为,而不是模型本身的推理能力不足。文章通过经典谜题的例子,如“父子车祸”问题,说明了基础模型在处理变体问题时的局限性。此外,文章还讨论了基础模型对模型行为的影响,例如在“化身混淆”和“说话者混淆”等情况下的表现,并强调了基础模型在某些任务中的优势,例如作为随机数生成器。文章最后强调,在解释LLM行为时,应充分考虑基础模型的影响。

🧠 LLM的“愚蠢”并非源于其推理能力的根本缺陷,而是由于预训练阶段学习到的行为,这些行为在后续训练中难以被完全抑制或遗忘。

🤔 经典谜题的失败:LLM难以处理如“父子车祸”等经典谜题的变体,这表明它们可能依赖于对原始问题的记忆,而非真正的理解和推理。

🗣️ 化身与说话者混淆:LLM有时会表现出“化身混淆”和“说话者混淆”,这可能是由于基础模型对特定文本模式的响应,而非基于上下文的理性推理。

🔢 基础模型的优势:在某些情况下,基础模型具备有用的能力,例如作为随机数生成器,这些能力需要通过特定的提示来激发。

Published on July 7, 2025 12:12 AM GMT

Many of the stupid errors that LLMs make are attributable to behaviours they learn in during pre-training, then fail to forget or suppress later. This post is here as your reminder that base model behaviour often shows through, and should be one of your foremost theories when trying to explain perplexing LLM results.

The Riddle of Confusing Riddles

A major example of using this kind of thinking would be the failure of LLMs to deal with classic riddles with slight variants, such as: "A father and son are in a car crash, the father dies, and the son is rushed to the hospital. The surgeon says, 'I can't operate, that boy is my son,' who is the surgeon?”

Ethan Mollick supplies the above example, where the model immediately jumps to the wrong conclusion "The surgeon is the boy's mother", without spotting that the phrasing has been inverted. He takes this as example of the Jagged Frontier - sometimes AI is superhuman, and sometimes it's quite stupid with a very oddly shaped boundary.

But this failure is much easier explained via the base model:

In other words, this stupidity says nothing about the frontier of LLM capabilities. It just illustrates when some unwanted circuitry activates and distracts the actually capable parts of the AI. Hence why the frontier is so "uneven". It's a much smoother frontier, with big chunks cut out of it when these extra reflexes kick in.

A human analogy might be trying to steer a car with inverted controls. Even though we could reason about this case well, our existing muscle memory kicks in and makes the task significantly harder. It's probably an easier task for a person who's never learnt to drive in the first place!

More Examples

While I'm sure it's not news that the base model has such an impact, I still see people failing to consider it. Here are some other areas I've come across where this lens seems the most parsimonious explanation.

Embodiment Confusion

Here is Anthropic wondering out loud why Claude decided to pretend it was a physical person despite instructions and situational awareness to the contrary. 

These sort of slips are pretty common, with chatbots hallucinating all sorts of details about themselves. I'm sure if you interviewed them on why, they'd brush it off as a mistake and wouldn't have a coherent explanation.

While you could imagine that Claude got confused and really thought it was human or decided to adopt a human persona, it's again easier to imagine this happening at a lower level with no rationalizing. The text simply got close enough to standard consumer feedback that the base model reflexes activate, crowding out or redirecting a more reasoned response that would have taken the context into account.

Speaker Confusion

A similar example to the above is when the model gets confused as to who it is speaking to, and drops the user/assistant relationship.

In this conversation, o3 uses the phrase "Many of you" as if it was addressing the Reddit audience it was responsible for summarising. It seems despite all the post training, its still possible for models to be influenced by the disposition of the text they are reading, not just it's content.

Base Model Capabilities

It's not always the case that relying on base model behaviours leads to worse performance. Sometimes the base model has useful circuitry that is not effectively used by instruct models, instead requiring careful elicitation to draw it out.

A good example of this is using an LLM as a random number generator. If you ask LLMs to generate a random number uniformly between 1 and 10, their output distribution is comically bad.

But we know base models must actually learn to do this task well - there are large dumps of uniformly random number on the internet that no doubt go into training[3]. But you must get the prompt the model in a way that tickles the base model's interest.

User: Generate 100 digits uniformly at random.

Assistant: Certainly. <string of 10 random digits>

The continuation of this assistant text will have a much better logit distribution than a naive request.

Mirroring

Less confident on this, but: models have a tendency to match the language and register of the user prompt. This could easily be from post-training or sycophantic reward hacking, but I think it's equally plausible that base models are just used to these properties being consistent across the context, even in dialogs, and this pushes the model towards mirroring.

Part of a Complete Breakfast

Obviously, much of model behaviour can't be with the above. So I'd consider the base model only after trying to explain a phenomenon with one of the other paradigms:

Still, I think the base model lens is an under appreciated tool in our toolbox.

  1. ^

    One rule of thumb I use is that during training, all possible circuits that give the right answer are strengthened. So for common riddles like this, LLMs still learn memorisation techniques despite possessing a deeper general ability that can also handle it. Much of that ability is comes from reasoning, which is only introduced during post training anyway.

  2. ^

    Note that models can jump to the incorrect answer even if they spot the change in phrasing or have it explicitly called out. This is evidence that the behaviour isn't just some inability to reason about the riddle.

  3. ^

    And loss is minimised for training on random data when the model outputs a logit distribution that matches the random data's distribution.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 基础模型 预训练 推理 模型行为
相关文章