MarkTechPost@AI 14小时前
Sakana AI Introduces Reinforcement-Learned Teachers (RLTs): Efficiently Distilling Reasoning in LLMs Using Small-Scale Reinforcement Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Sakana AI 推出了一种名为 Reinforcement-Learned Teachers (RLTs) 的创新框架,用于提高语言模型 (LLM) 的推理能力。与传统的强化学习 (RL) 方法不同,RLTs 专注于通过训练较小的模型来充当优化的指导者,生成逐步的解释,而不是从头开始解决问题。这种设计转变带来了在蒸馏质量、成本效率和跨领域的可迁移性方面的显着提升,而无需大型模型。RLTs 通过对问题和解决方案进行提示,促使模型生成详细的解释,这使得学习过程更加高效。

🧠 RLTs 的核心在于重新定义了教师-学生范式。与传统 RL 模型不同,RLTs 接收问题及其解决方案,然后训练生成逐步的解释,而不是从头开始解决问题。这种方法使得 RL 的奖励与学生的学习结果保持一致。

💡 RLTs 使用密集的、与学生对齐的奖励。训练目标围绕两个关键的奖励项构建:解决方案得分 (rSS) 和解释得分 (rKL)。rSS 衡量学生在给定解释和问题的情况下重建正确解决方案的能力;rKL 衡量教师的解释从学生的角度来看的逻辑连贯性。

🚀 RLTs 在蒸馏任务中表现出色。一个 70 亿参数的 RLT 在多个具有挑战性的数据集上(包括 AIME 2024、MATH 500 和 GPQA Diamond)优于更大的 LLM(例如,320 亿参数以上的模型)。RLT-7B 优于 DeepSeek R1、Bespoke-7B,甚至经过后处理的 RL 轨迹。

🌍 RLTs 具有强大的零样本迁移能力。当应用于新领域(例如,基于算术的“倒计时”任务)时,经过 RLT 训练的轨迹使学生模型能够超越在新领域直接进行 RL。这表明“解释解决方案”的技能比“从头开始解决”的技能更容易在任务之间泛化。

🛠️ RLT 的训练流程高效且可扩展。RLTs 不需要后处理、格式更正或验证过滤器,可以直接使用原始输出。训练过程计算量小,使用单节点设置,并且代码和预训练检查点均可获取。

Sakana AI introduces a novel framework for reasoning language models (LLMs) with a focus on efficiency and reusability: Reinforcement-Learned Teachers (RLTs). Traditional reinforcement learning (RL) approaches in LLMs are plagued by sparse reward signals and prohibitively high computational demands. By contrast, RLTs redefine the teacher-student paradigm by training smaller models to act as optimized instructors, producing step-by-step explanations instead of solving problems from scratch. This design shift enables significant gains in distillation quality, cost-efficiency, and transferability across domains—without the need for large model footprints.

Rethinking Reinforcement Learning for Teaching, Not Solving

Conventional RL setups train models to solve problems autonomously using sparse, correctness-based rewards. These models are often repurposed to teach smaller models, generating reasoning traces for distillation. However, the mismatch between the RL objective (solving problems) and the actual downstream use (teaching) results in inefficiencies. RLTs directly address this by prompting models with both the problem and its solution, requiring them only to generate detailed, pedagogical explanations. The reward signal is dense and student-aligned: it measures how well the student model understands the explanation and reproduces the solution.

Core Concept: Dense, Student-Aligned Rewards

The RLT training objective is constructed around two key reward terms:

These are combined into a dense reward signal that encourages explanations which are both instructive and understandable. Importantly, this bypasses the exploration bottleneck of traditional RL, enabling smaller models to effectively train via RL.

Surprising Efficacy of Small Teachers

Sakana AI demonstrates that a 7B parameter RLT outperforms much larger LLMs (e.g., 32B+ models) on distillation tasks across multiple challenging datasets, including AIME 2024, MATH 500, and GPQA Diamond. On a 17K-question corpus:

The impact is not just parameter efficiency—RLTs achieve better generalization, fewer formatting errors, and higher interpretability.

Cold-Starting Reinforcement Learning with RLTs

Another critical use case is RL cold-starting, where an initial model is bootstrapped with external data before formal RL training. Traces generated by RLTs serve as more effective cold-start material than those from larger RL-trained models. In fact, even without post-processing or external refinement (e.g., via GPT-4.1), RLT-generated explanations yield higher performance gains after RL fine-tuning.

Out-of-Domain Generalization and Zero-Shot Transfer

RLTs also show strong zero-shot transfer capabilities. When applied to a novel domain—such as the arithmetic-based “Countdown” task—the RLT-trained traces enable student models to surpass even direct RL on the new domain. This indicates that the skill of “explaining a solution” generalizes across tasks more easily than the skill of “solving from scratch,” providing evidence for better reusability of teaching-focused RL models.

Training Pipeline: Efficient and Scalable

The training process is computationally lean:

Unlike traditional RL pipelines, RLTs do not require post-processing, formatting corrections, or verification filters—raw outputs are directly usable.

Evaluation Highlights

TL;DR (100 words)

Sakana AI introduces Reinforcement-Learned Teachers (RLTs), a lightweight yet powerful framework for teaching LLMs to reason. Unlike traditional RL models that learn by solving tasks from scratch, RLTs are given both the question and its solution and are trained to generate step-by-step explanations. This setup aligns RL rewards with student learning outcomes, enabling 7B parameter RLTs to outperform much larger LLMs in distillation and cold-start scenarios. RLTs are cost-efficient, transferable across domains, and eliminate the need for expensive post-processing—offering a scalable blueprint for building reasoning-capable LLMs using modest compute and open-source tools.


Check out the Paper and Technical details All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Sakana AI Introduces Reinforcement-Learned Teachers (RLTs): Efficiently Distilling Reasoning in LLMs Using Small-Scale Reinforcement Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RLTs LLMs 强化学习 模型蒸馏
相关文章