MarkTechPost@AI 2024年11月25日
Intel AI Research Releases FastDraft: A Cost-Effective Method for Pre-Training and Aligning Draft Models with Any LLM for Speculative Decoding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Intel发布了FastDraft,这是一个用于训练和对齐草稿模型的框架,旨在加速大型语言模型(LLM)的推理速度。FastDraft通过预训练和微调来提高草稿模型与目标LLM的一致性,从而在不影响输出质量的情况下显著提升推理效率。该框架在代码生成和文本摘要等任务中取得了显著的性能提升,例如Phi-3-mini草稿模型在代码任务中实现了高达3倍的内存加速,Llama-3.1-8B草稿模型在文本摘要任务中实现了2倍的加速。FastDraft的出现为LLM在边缘设备上的部署提供了更具实用性的解决方案,为自然语言处理领域带来了新的发展方向。

🤔 **FastDraft框架旨在加速大型语言模型(LLM)的推理速度。**它通过训练和对齐草稿模型来实现,这些草稿模型可以预测目标LLM的输出序列,从而减少推理时间和内存消耗。

🚀 **FastDraft采用预训练和微调的方法来提高草稿模型与目标LLM的一致性。**预训练阶段使用大量自然语言和代码数据训练草稿模型,微调阶段则使用序列级知识蒸馏来优化草稿模型的输出,使其更接近目标LLM。

📊 **FastDraft在多个任务上实现了显著的性能提升。**例如,Phi-3-mini草稿模型在代码任务中实现了高达3倍的内存加速,Llama-3.1-8B草稿模型在文本摘要任务中实现了2倍的加速。

⏱️ **FastDraft在资源效率方面表现出色。**训练草稿模型仅需不到24小时,且可以在配备8个Intel® Gaudi® 2加速器的单台服务器上完成,非常适合资源受限的环境。

💡 **FastDraft为LLM在边缘设备上的部署提供了更具实用性的解决方案。**它有效地解决了LLM在推理速度和资源消耗方面的挑战,为自然语言处理技术的进一步发展奠定了基础。

Transformer architectures have revolutionized Natural Language Processing (NLP), enabling significant language understanding and generation progress. Large Language Models (LLMs), which rely on these architectures, have achieved remarkable performance across various applications such as conversational systems, content creation, and summarization. However, the efficiency of LLMs in real-world deployment remains a challenge due to their substantial resource demands, particularly in tasks requiring sequential token generation.

A critical issue with LLMs lies in their inference speed, which is constrained by the high memory bandwidth requirements and sequential nature of auto-regressive generation (ARG). These limitations prevent LLMs from being effectively used in time-sensitive applications or on devices with limited computational capacity, such as personal computers or smartphones. As users increasingly demand real-time processing and responsiveness, addressing these bottlenecks has become a priority for researchers and industry practitioners.

One promising solution is Speculative Decoding (SD), a method designed to accelerate LLM inference without compromising generated output quality. SD employs draft models to predict token sequences, which the target model validates in parallel. Despite its potential, the adoption of SD has been hindered by the scarcity of efficient draft models. These models must align with the target LLM’s vocabulary and achieve high acceptance rates, a challenging requirement given the incompatibility issues in existing approaches.

Researchers at Intel Labs introduced FastDraft, an efficient framework for training and aligning draft models compatible with various target LLMs, including Phi-3-mini and Llama-3.1-8B. FastDraft stands out by employing a structured approach to pre-training and fine-tuning. Pre-training focuses on processing datasets containing up to 10 billion tokens of natural language and code while fine-tuning uses sequence-level knowledge distillation to improve draft-target alignment. This process ensures that the draft models achieve optimal performance across diverse tasks.

FastDraft’s architecture imposes minimal requirements, allowing for flexibility in model design while ensuring compatibility with the target LLM’s vocabulary. During pre-training, the draft model predicts the next token in a sequence, using datasets like FineWeb for natural language and The Stack v2 for code. The alignment phase employs synthetic datasets generated by the target model, refining the draft model’s ability to mimic the target model’s behavior. These techniques ensure that the draft model maintains high efficiency and accuracy.

The performance improvements achieved by FastDraft are significant. For instance, the Phi-3-mini draft, trained on 10 billion tokens, achieved a 67% acceptance rate with up to a 3x memory-bound speedup in code tasks. Similarly, the Llama-3.1-8B draft model demonstrated a 2x speedup in summarization and text completion tasks. FastDraft enabled these draft models to be trained on a single server equipped with 8 Intel® Gaudi® 2 accelerators in less than 24 hours. This efficiency makes FastDraft particularly suitable for resource-constrained environments.

The research also provides valuable insights for future LLM draft model training advancements. Key takeaways include:

In conclusion, FastDraft addresses the critical limitations of LLM inference by introducing a scalable, resource-efficient framework for training draft models. Its innovative methods of pre-training and alignment significantly enhance performance metrics, making it a practical solution for deploying LLMs on edge devices. FastDraft lays a strong foundation for future developments in NLP technology by demonstrating substantial improvements in inference speed and resource efficiency.


Check out the Paper, Model on Hugging Face, and Code on the GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.

The post Intel AI Research Releases FastDraft: A Cost-Effective Method for Pre-Training and Aligning Draft Models with Any LLM for Speculative Decoding appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 推理加速 FastDraft 草稿模型 Speculative Decoding
相关文章