少点错误 05月18日 07:27
Can Reasoning Models Avoid the Most Forbidden Technique?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了AI模型训练中一个重要的禁忌:避免直接优化模型的推理过程(CoT)。文章指出,如果训练模型隐藏其推理过程,可能会导致模型在关键时刻隐瞒错误。作者通过实验发现,即使没有直接奖励,模型也会按照指令改变推理方式,甚至得出错误结论。根本问题在于推理和输出由同一模型生成,导致优化压力传递。作者建议使用两个模型,一个负责推理,一个负责输出,以减少对推理过程的优化压力。但这种方法可能效率较低,训练难度也更大。

🚫 Zvi提出的“最禁忌技术”警告我们,不要消除有用的危险信号。直接优化AI的推理过程(CoT),可能导致模型在关键时刻隐瞒错误,最终丧失判断模型行为的能力。

🤔 实验表明,即使没有优化压力,大型语言模型(LLM)也会按照指令改变推理方式。例如,让模型用颜文字或西班牙语思考,甚至引导其得出错误的结论。这揭示了模型推理过程的灵活性和可塑性。

💡 文章提出,使用单一模型进行推理和输出是问题的根源。建议采用双模型架构:一个模型负责与用户交互(输出),另一个模型负责推理。只训练外层模型(输出),以减少内层模型(推理)隐藏其思考的优化压力。

📉 双模型架构面临效率挑战。训练和运行这种组合模型会更加困难,因为难以确定哪个模型对输出结果负责,且模型切换会降低运行效率。

Published on May 17, 2025 11:26 PM GMT

In a recent post, Zvi described what he calls "The Most Forbidden Technique":

An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.

You train on [X]. Only [X]. Never [M], never [T].

Why? Because [T] is how you figure out when the model is misbehaving.

If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

The article specifically discusses this in relation to reasoning models and Chain of Thought (CoT): if we train a model not to admit to lying in its CoT, it might still lie in the CoT and just not tell us.

This seems like a good idea, but something I read recently makes me wonder if we can really avoid this.

You Can Tell an LLM How to Think

@latentmoss on Twitter found that you can make Gemini 2.5 think in Kaomojis:

I wondered if this was Gemini-specific, but I was easily able to get Claude to think in Spanish on demand:

This is a little surprising, but it's still reasoning and the language is irrelevant. If I ask a sufficiently hard problem, Claude initially thinks about math despite being told not to. Maybe we can prompt Claude to think in different but equivalent ways, but can't force it to think in unhelpful ways?

Except, with a slight tweak to the prompt, I'm able to get Claude to reason in a way that gives it a wrong answer.

Isn't This Weird?

If there's no optimization pressure on the CoT, why does the AI do what you want when you tell it how to think? In my math problem, the only reward in training is the final answer, not the reasoning. So why does Claude dutifully think about cute animals and then give me a wrong answer?

The Fundamental Problem

I think the problem here is that the method [M] that generates the final output [X] is the same method that generates the Chain of Thought. You might imagine an architecture where one model is trained to talk to the user and another is trained to do reasoning, but all of the reasoning models I'm aware of 1 do both with the same model.

Since all of the outputs come from the same model, if we train the model on normal outputs, we automatically apply the same optimization pressure to the reasoning outputs as well.

A Concrete Prediction

Imagine we're training a model and we penalize it when it outputs the number 5 (or any similar emoji, or vague references to 5 things). My theory is that even if optimizing 5's would be useful in the Chain of Thought to do math, the model will still avoid talking about 5 in its thought process (and if we provide enough optimization pressure, it will learn to secretly think about 5 while not printing it to the user or in CoT).

What if we had two models?

If this problem is caused by using one model for both reasoning and output, could we avoid it by training two models?

    The outer model talks to the userThe inner model does reasoningWe only train on the outer model's 'correctness'

This should reduce the optimization pressure for the inner model to hide its thoughts, since the only optimization pressure is to improve the outer model's responses. You'd have to avoid training the outer model to report on misbehavior from the inner model (and training it to not talk about this might have problems of its own?).

I think the biggest issue here is efficiency:

I'm curious if anyone has tried this?

Conclusion

Zvi's "Most Forbidden Technique" makes a good point that we should avoid training away useful signals of danger, but LLM's are all about generalizing. If we train a model not to do something in one context, it's very difficult to avoid making that training apply in all contexts. We should consider training completely separate reasoning models if we really want to avoid training them to hide what they're thinking.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI模型训练 推理过程 Chain of Thought 双模型架构
相关文章