MarkTechPost@AI 03月04日
Accelerating AI: How Distilled Reasoners Scale Inference Compute for Faster, Smarter LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种新的知识蒸馏方法,旨在创建具有强大推理能力的亚二次模型,从而在保持推理能力的同时提高效率。研究人员通过这种方法,成功地让蒸馏后的模型在MATH和GSM8K任务中超越了它们的Transformer教师模型,在推理时间减少2.5倍的情况下,实现了相似的准确率。这表明,推理和数学技能可以在不同的模型架构之间传递,同时降低计算成本。该研究还深入探讨了数据集选择对性能的影响,以及通过监督微调进一步提升模型性能的潜力。最终,Distilled Mamba模型在固定计算预算下优于Transformers,适用于可扩展的推理。

🚀 研究人员提出了一种知识蒸馏方法,旨在创建具有强大推理能力的亚二次模型,从而在保持推理能力的同时提高效率。

🧠 实验结果表明,蒸馏后的模型在MATH和GSM8K任务中超越了它们的Transformer教师模型,在推理时间减少2.5倍的情况下,实现了相似的准确率。

📊 研究人员评估了蒸馏模型在数学问题解决中生成多个思维链(CoT)的能力,重点关注指令遵循的保留。通过pass@k测量覆盖率,并使用基于Llama-3.1 8B的奖励模型评估准确性。

Improving how large language models (LLMs) handle complex reasoning tasks while keeping computational costs low is a challenge. Generating multiple reasoning steps and selecting the best answer increases accuracy, but this process demands a lot of memory and computing power. Dealing with long reasoning chains or huge batches is computationally expensive and slows down models, rendering them inefficient under bounded computational resources. Other models of varying architectures have faster information processing and less memory, but their performance capability in reasoning tasks is unknown. Understanding whether these models can match or exceed existing ones under limited resources is important for making LLMs more efficient.

Currently, methods to improve reasoning in large language models rely on generating multiple reasoning steps and selecting the best answer using techniques like majority voting and trained reward models. The methods improve accuracy levels, although they need large computation systems, which makes them ill-suited for massive data processing. The processing power requirements and the memory needs of Transformer models slow down inference operations. Recurrent models and linear attention methods work faster in processing but lack effectiveness in reasoning operations. Knowledge distillation helps transfer knowledge from large to smaller models, but whether strong reasoning abilities transfer across different model types is unclear.

To mitigate these issues, researchers from University of Geneva, Together AI, Cornell University, EPFL, Carnegie Mellon University, Cartesia.ai, META and Princeton University proposed a distillation method to create subquadratic models with strong reasoning skills, improving efficiency while preserving reasoning capabilities. The distilled models outperformed their Transformer teachers on MATH and GSM8K tasks, achieving similar accuracy with 2.5× lower inference time. This demonstrated that reasoning and mathematical skills could transfer across architectures while reducing computational costs.

The framework included two model types: pure Mamba models (Llamba) and hybrid models (MambaInLlama). Llamba used the MOHAWK distillation method, aligning matrices, matching hidden states, and transferring weights while training on an 8B-token dataset. MambaInLlama retained Transformer attention layers but replaced others with Mamba layers, using reverse KL divergence for distillation. Experiments demonstrated dataset choice had a large effect on performance, with certain datasets lowering Llamba-1B accuracy by 10% and showing a poor correlation between general benchmarks and mathematical reasoning, emphasizing the importance of improved training data.

Researchers evaluated distilled models for generating multiple chains of thought (CoTs) in math problem-solving, focusing on instruction-following retention. They measured coverage using pass@k, estimated the probability of finding a correct solution among k samples, and assessed accuracy through majority voting and Best-of-N selection with a Llama-3.1 8B-based reward model. Benchmarks showed distilled models performed up to 4.2× faster than Llama models while maintaining comparable coverage, generating more completions within fixed compute budgets, and outperforming smaller transformer baselines in speed and accuracy. Furthermore, supervised fine-tuning (SFT) after distillation enhanced performance, validating their effectiveness in structured reasoning tasks such as coding and formal proofs.

In summary, the proposed Distilled Mamba models enhanced reasoning efficiency by retaining accuracy while cutting inference time and memory consumption. When computational budgets were fixed, the models outperformed Transformers; hence, they are suitable for scalable inference. This method can serve as a basis for future research in training good reasoning models, improving distillation methods, and building reward models. Inference scaling advancements would further enhance their application in AI systems that demand faster and more effective reasoning.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Accelerating AI: How Distilled Reasoners Scale Inference Compute for Faster, Smarter LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

知识蒸馏 LLM 推理效率
相关文章