MarkTechPost@AI 02月01日
Memorization vs. Generalization: How Supervised Fine-Tuning SFT and Reinforcement Learning RL Shape Foundation Model Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了监督微调(SFT)和强化学习(RL)如何影响AI模型的泛化能力。研究发现,SFT倾向于记忆训练数据,导致模型在面对新情境时表现不佳。例如,SFT模型可能擅长使用特定规则的算术问题,但当规则改变时就会失败。相反,RL通过优化奖励来鼓励模型理解任务结构,从而更好地适应新规则和环境。研究人员通过设计算术推理和视觉导航任务,在受控环境中测试了模型的泛化能力。结果表明,RL在应对分布外(OOD)任务时明显优于SFT。此外,SFT虽然对初始化模型至关重要,但过度调整会损害RL的适应性。因此,应将RL作为SFT的补充,在模型具备基本任务能力后使用,以提高AI系统的鲁棒性。

🧮 SFT 倾向于记忆训练数据,导致模型在面对新情境时表现不佳,例如在算术推理任务中,SFT模型会记忆特定的卡片颜色和数值对应关系,而不是学习算术规则,在规则改变时表现下降。

🚀 RL 通过优化奖励来鼓励模型理解任务结构,从而更好地适应新规则和环境。例如,在视觉导航任务中,RL模型学习空间关系而不是记忆地标序列,因此在新的城市布局中表现更好。

🔬 研究通过设计算术推理(GeneralPoints)和视觉导航(V-IRL)任务,在受控环境中测试了模型的泛化能力, 并使用Llama-3.2-Vision-11B作为基础模型,先进行SFT,再进行RL。

📈 实验结果表明,在分布外(OOD)任务中,RL的性能明显优于SFT。在算术任务中,RL模型在卡片规则改变时仍能正确计算,而SFT模型则会使用记忆的错误答案。在视觉任务中,RL模型可以在新的城市中导航,而SFT模型则会迷失方向。

🛠️ SFT 虽然对初始化模型至关重要,但过度调整会损害RL的适应性。研究表明,过度依赖SFT可能会“锁定”模型记忆的模式,限制RL探索新的解决方案。因此,应将RL作为SFT的补充,在模型具备基本任务能力后使用,以提高AI系统的鲁棒性。

Modern AI systems rely heavily on post-training techniques like supervised fine-tuning (SFT) and reinforcement learning (RL) to adapt foundation models for specific tasks. However, a critical question remains unresolved: do these methods help models memorize training data or generalize to new scenarios? This distinction is vital for building robust AI systems capable of handling real-world variability.

Prior work suggests SFT risks overfitting to training data, making models brittle when faced with new task variants. For example, an SFT-tuned model might excel at arithmetic problems using specific card values (e.g., treating ‘J’ as 11) but fail if the rules change (e.g., ‘J’ becomes 10). Similarly, RL’s reliance on reward signals could either encourage flexible problem-solving or reinforce narrow strategies. However, existing evaluations often conflate memorization and true generalization, leaving practitioners uncertain about which method to prioritize.  In a latest paper from HKU, UC Berkeley, Google DeepMind, and NYU investigate this by comparing how SFT and RL affect a model’s ability to adapt to unseen rule-based and visual challenges.

They propose to test generalization in controlled settings to isolate memorization from generalization. Researchers designed two tasks: GeneralPoints (arithmetic reasoning) and V-IRL (visual navigation). Both tasks include in-distribution (ID) training data and out-of-distribution (OOD) variants to test adaptability:

    Rule-Based Generalization (GeneralPoints, shown in Fig 3):
      Task: Create equations equal to 24 using four numbers from playing cards.Variants: Change card-value rules (e.g., ‘J’ = 11 vs. ‘J’ = 10) or card colors (red vs. blue).Goal: Determine if models learn arithmetic principles or memorize specific rules.
    Visual Generalization (V-IRL, shown in Fig 4):
      Task: Navigate to a target location using visual landmarks.Variants: Switch action spaces (absolute directions like “north” vs. relative commands like “turn left”) or test in unseen cities.Goal: Assess spatial reasoning independent of memorized landmarks.

For experiments, the study uses Llama-3.2-Vision-11B as the base model, applying SFT first (standard practice) followed by RL. Key experiments measured performance on OOD tasks after each training phase. Let’s now discuss some critical insights from the paper:

How do SFT and RL Differ in Learning Mechanisms?

Another critical insight is that RL benefits from verification iterations—multiple attempts to solve a task within a single training step. More iterations (e.g., 10 vs. 1) allow the model to explore diverse strategies, improving OOD performance by +5.99% in some cases.

In performance evaluation RL outperforms SFT consistently in both tasks as shown in Fig 5 & 6:

    Rule-Based Tasks: 
      RL improved OOD accuracy by +3.5% (GP-L) and +11.0% (V-IRL-L), while SFT degraded performance by -8.1% and -79.5%, respectively.Example: When card rules changed from ‘J=11’ to ‘J=10’, RL models adjusted equations using the new values, whereas SFT models reused invalid memorized solutions.
    Visual Tasks:
      RL boosted OOD performance by +17.6% (GP-VL) and +61.1% (V-IRL-VL), while SFT dropped by -9.9% and -5.6%.In V-IRL, RL agents navigated unseen cities by recognizing spatial patterns, while SFT failed due to reliance on memorized landmarks.

The study also suggests that SFT is necessary to initialize models for RL. Without SFT, RL struggles because the base model lacks basic instruction-following skills. However, overly-tuned SFT checkpoints harm RL’s adaptability, where RL couldn’t recover OOD performance after excessive SFT. However, the researchers clarify that their findings—specific to the Llama-3.2 backbone model—do not conflict with earlier work such as DeepSeekAI et al. (2025), which proposed that SFT could be omitted for downstream RL training when using alternative base architectures.

In conclusion, this study demonstrates a clear trade-off: SFT excels at fitting training data but falters under distribution shifts, while RL prioritizes adaptable, generalizable strategies. For practitioners, this implies that RL should follow SFT—but only until the model achieves basic task competence. Over-reliance on SFT risks “locking in” memorized patterns, limiting RL’s ability to explore novel solutions. However, RL isn’t a panacea; it requires careful tuning (e.g., verification steps) and balanced initialization.


Check out the PAPER. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)

The post Memorization vs. Generalization: How Supervised Fine-Tuning SFT and Reinforcement Learning RL Shape Foundation Model Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

监督微调 强化学习 泛化能力 模型训练 AI鲁棒性
相关文章