MarkTechPost@AI 20小时前
Can We Improve Llama 3’s Reasoning Through Post-Training Alone? ASTRO Shows +16% to +20% Benchmark Gains
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI和华盛顿大学的研究人员开发了ASTRO,一种创新的后训练框架,旨在增强Llama-3.1-70B-Instruct的推理能力。ASTRO通过教导模型进行上下文搜索、自我反思和回溯,从而模拟人类问题解决和传统符号搜索算法。该方法显著提高了Llama 3在多个竞争性基准测试中的数学表现。ASTRO的核心在于使用蒙特卡洛树搜索(MCTS)生成链式思考(CoT),并将其转化为自然语言,用于监督微调(SFT)和强化学习(RL),最终使模型能够重新评估其推理过程,并在必要时进行自我纠正。

🧠 ASTRO是一个后训练框架,用于增强Llama-3.1-70B-Instruct的推理能力,无需改变模型架构。

🔍 ASTRO的核心技术是“程序克隆”,它将搜索树转化为链式思考(CoT),包括失败和恢复路径,并用于监督微调(SFT)。

📈 通过SFT,模型在MATH 500、AMC 2023和AIME 2024等数学基准测试中获得了显著提升,即使没有强化学习,也显示出性能提升。

🔄 ASTRO使用强化学习(RL),并以SFT的检查点为初始化,通过Group Relative Policy Optimization (GRPO)进行训练,模型生成CoT的长度增加,表明更深入的内部探索。

✅ ASTRO-RL模型在MATH 500、AMC 2023和AIME 2024上取得了与更大参数模型相当或更好的结果,证明了ASTRO的搜索感知初始化方法的重要性。

📊 研究表明,回溯频率与推理成功呈正相关,ASTRO-RL在训练过程中表现出更多的自我纠正行为和更深入的探索。

Improving the reasoning capabilities of large language models (LLMs) without architectural changes is a core challenge in advancing AI alignment and usability. Researchers at Meta AI and the University of Washington have introduced ASTROAutoregressive Search-Taught Reasoner—a novel post-training framework designed to enhance reasoning in Llama-3.1-70B-Instruct. ASTRO is unique in teaching models to perform in-context search, self-reflection, and backtracking, mechanisms often associated with human problem-solving and traditional symbolic search algorithms. Through this approach, ASTRO boosts Llama 3’s math performance on several competitive benchmarks with significant improvements:

Search-Guided Chain-of-Thought Generation

ASTRO’s methodology begins with a Monte Carlo Tree Search (MCTS) over mathematical problem-solving trajectories. This search explores both correct and incorrect reasoning paths. The key innovation is procedure cloning: entire search trees are linearized into long chain-of-thoughts (CoT) that naturally encode both failures and recoveries via self-reflection and backtracking. These linearized traces are rewritten in natural language and used as the basis for supervised fine-tuning (SFT).

This results in a model that doesn’t just solve problems step-by-step but reevaluates its trajectory—often backtracking after self-assessment to correct intermediate reasoning mistakes. For instance, the model may interject with phrases like “Let’s go back to where we set up the equation” when its internal confidence drops.

Supervised Fine-Tuning: Injecting Search Priors

ASTRO fine-tunes Llama-3.1-70B-Instruct on 36.1K curated CoT solutions from MATH, AMC/AIME, and AoPS-style datasets. The model trained with ASTRO-SFT achieves:

These scores are competitive with or exceed those of baseline and SPOC/Step-KTO variants trained without explicit search priors. Importantly, even SFT alone—without reinforcement learning—yields performance boosts by exposing the model to search-structured reasoning data.

Reinforcement Learning with Search-Aware Initialization

ASTRO proceeds to reinforcement learning (RL) by initializing with the SFT checkpoint and running an RL loop using a modified Group Relative Policy Optimization (GRPO). Unlike standard preference-based RL, ASTRO employs verifiable reward signals (+1 for correct, -1 for incorrect) on 8.7K moderately difficult prompts. During training, the model’s CoT generation grows longer—from ~1.8K to ~6K tokens—demonstrating deeper internal exploration.

The resulting ASTRO-RL model achieves:

These results rival or exceed models with larger parameter counts and confirm the importance of ASTRO’s search-aware initialization.

Backtracking Behavior Correlates with Reasoning Success

A striking empirical observation is the positive correlation between backtracking frequency and performance. As training progresses, ASTRO-RL exhibits more self-corrective actions and deeper exploration. Pearson correlation coefficients across benchmarks exceed 0.8, indicating that self-reflection and backtracking are not merely cosmetic behaviors but functionally tied to better accuracy.

Comparative Insights and Broader Impact

Control experiments comparing ASTRO with models trained on direct CoT solutions (no search priors) reveal that even when trained on the same problem sets and search trees, ASTRO consistently outperforms. For instance, ASTRO-RL beats Direct-RL by:

Moreover, ASTRO’s outputs can be visualized as directed graphs, with nodes as reasoning steps and edges capturing transitions, reflections, and corrections—facilitating better interpretability.

ASTRO Key Takeaways Table

Conclusion

ASTRO demonstrates that LLMs like Llama 3 can learn to reason more effectively—not through larger models or longer pretraining, but via principled post-training techniques. By mimicking search algorithms in natural language, ASTRO enables models to think before answering, doubt their own steps, and correct themselves mid-reasoning. This framework sets a new benchmark for fine-tuning open LLMs to approach human-like reasoning through search-inspired behaviors.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Can We Improve Llama 3’s Reasoning Through Post-Training Alone? ASTRO Shows +16% to +20% Benchmark Gains appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ASTRO Llama 3 推理能力 后训练 大语言模型
相关文章