少点错误 07月24日 00:20
Reasoning-Finetuning Repurposes Latent Representations in Base Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究人员在对大型语言模型进行微调以增强推理能力时,观察到了一种名为“回溯”的涌现行为,表现为模型会输出“等等”等词语。他们通过计算一种“引导向量”,发现该向量能够诱导经过微调的模型出现回溯行为,但却无法影响基础模型。这一现象表明,基础模型可能已经包含了与回溯相关的潜在概念,而微调过程则学会了如何利用这些概念来实现回溯,从而提升了模型的推理能力。研究人员正在进一步探索该引导向量所代表的具体概念。

✨ 研究发现,在经过强化学习微调的推理模型(如DeepSeek-R1)中,“回溯”(如输出“等等”)是一种涌现行为,对提升模型的推理能力至关重要。

🔬 研究人员通过计算一种“引导向量”,成功地诱导了微调模型出现回溯行为,但令人惊讶的是,该向量对基础模型(如Llama-3.1-8b)却无效,基础模型即便受到引导也不会回溯。

💡 这一现象表明,回溯行为并非完全由微调过程新学得,而是基础模型可能已经内含了与回溯相关的潜在概念,微调过程则学会了如何激活和利用这些概念。

🤔 研究人员推测,该引导向量可能代表了某种“上游概念”,基础模型已经掌握了追踪这个概念,而微调模型则学会了将其用于回溯,但目前尚未明确该概念的具体含义,例如是否代表“不确定性”。

Published on July 23, 2025 4:18 PM GMT

Authors: Jake Ward, Chuqiao Lin, Constantin Venhoff, Neel Nanda (*Equal contribution). This work was completed during Neel Nanda's MATS 8.0 Training Phase.

TL;DR

Introduction

Reasoning models output  Wait, a lot. How did they learn to do this? Backtracking is an emergent behavior in RL-finetuned reasoning models like DeepSeek-R1, and appears to contribute substantially to these models' improved reasoning capabilities. We study representations related to this behavior using steering vectors, and find a direction which is present both in base models and associated reasoning-finetuned models but induces backtracking only in reasoning models. We interpret this direction as representing some concept upstream of backtracking which the base model has learned to keep track of, while only the reasoning model has learned to use this concept for backtracking.

We start with methodology similar to Venhoff et al.: We generate a corpus of reasoning rollouts using DeepSeek-R1, and annotate them to identify backtracking sentences using GPT-4o. Then, we train difference-of-means steering vectors with two novel properties:

    We compute steering vectors using token positions which occur before backtracking, using a 5-token window which starts 12 tokens before the beginning of a backtracking sentence. This usually includes the beginning of the sentence immediately before a backtracking sentence.We sample activations from a forward pass of the base model (Llama-3.1-8b) at layer 10, and use computed vectors to steer the reasoning model (DeepSeek-R1-Distill-Llama-8B).
Green highlights represent tokens from which our backtracking steering vectors are computed, red highlights indicate the start of backtracking.

Surprisingly, we find that steering vectors computed with this method are highly effective at inducing the emission of backtracking tokens ( Wait But, etc.) in the reasoning model, but not in the base model.

Additionally, if we instead sample activations from the reasoning model, we end up computing a nearly identical vector with > 0.7 cosine similarity. These vectors behave similarly: They induce backtracking in the reasoning model, but not in the base model.

Note that the base model never exhibits backtracking behavior, even when steered with the reasoning model-derived backtracking-inducing vector.

Analysis

Logit Lens

We considered whether these vectors may be doing something trivial, like directly boosting the logits for backtracking tokens like  Wait. We test this with Logit Lens, and find that this is not the case:

Base model-derived steering vectors (blue) never map to backtracking logits directly when computed at early to mid layers.

Baseline Comparison

We compare our results to various baselines, and find that our backtracking steering vector has a substantially greater effect on the emission of the  Wait token compared to other methods. We try steering with mean activations, adding random noise, increasing the magnitude of activations, and steering vectors computed based on other reasoning behaviors like deduction or initialization (problem restatement). Notably, we do find that adding random noise to activations does induce backtracking token emission slightly, but this effect is small compared to the effect of our backtracking steering vector.

Conclusion

We take this as evidence that backtracking isn't a totally new behavior learned during finetuning. Instead, base models already contain useful concepts which the reasoning-finetuning process finds and repurposes for emergent reasoning behavior. We're still not really sure what concept our steering vector represents, or if there is a monosemantic concept which it represents. We investigated whether it may represent something abstract like "uncertainty", but so far these investigations are inconclusive. These results should be interpreted as an existence proof for latent reasoning-related representations in base models, and we hope they inspire further investigations into the mechanisms involved in CoT reasoning.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 推理能力 模型微调 引导向量 涌现行为
相关文章