MarkTechPost@AI 02月15日
Layer Parallelism: Enhancing LLM Inference Efficiency Through Parallel Execution of Transformer Layers
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了来自日内瓦大学、EPFL和Meta FAIR的研究人员提出的一种名为“层并行(Layer Parallelism, LP)”的新方法,旨在通过并行执行Transformer层的层对来提高大型语言模型(LLM)的推理速度,而无需重新训练。该方法通过修改计算图,实现了层对的并行执行,从而在不显著影响模型性能的前提下,提升推理速度。实验结果表明,该方法在Llama2 7B和Llama3.2 3B模型上均取得了显著的加速效果,并通过微调进一步恢复了性能损失,为大规模LLM的部署提供了一种高效的解决方案。

💡 研究人员提出层并行(LP)方法,通过并行执行Transformer层对,提升LLM推理速度,无需重新训练,在perplexity和In-Context Learning (ICL) 基准测试中保持95%-99%的准确率。

🛠️ 该方法修改计算图,实现连续层并行运行,减少模型深度,同时保持性能,并使用张量并行在GPU间分配计算,优化效率,通过调整层归一化来维持稳定性。

📊 实验结果显示,在Llama2 7B和Llama3.2 3B模型上应用LP,分别减少了21%和18%的模型深度,加速了1.29倍和1.22倍,并通过微调恢复了10.8%的准确率损失。

🚀 结论表明,Transformer层不一定需要顺序处理,选择性并行化是可行的,LP增强了LLM在生产中的效率,未来将探索最佳层分组、与量化的互动以及对层独立性和计算效率的更深层理论见解。

LLMs have demonstrated exceptional capabilities, but their substantial computational demands pose significant challenges for large-scale deployment. While previous studies indicate that intermediate layers in deep neural networks can be reordered or removed without severely impacting performance, these insights have not been systematically leveraged to reduce inference costs. Given the rapid expansion of LLMs, which often contain hundreds of billions of parameters, optimizing inference is critical for improving efficiency, reducing latency, and reducing operational expenses. High-traffic applications relying on cloud-based LLM inference can incur monthly costs in the millions, making efficiency-driven solutions essential. Furthermore, the ability to deploy these models on resource-constrained devices necessitates strategies that maintain performance while minimizing computational overhead. Despite architectural similarities between modern transformers and deep residual networks, where layer depth can sometimes be redundant, research has yet to explore these redundancies to fully optimize inference efficiency.

Several approaches exist for improving the computational efficiency of LLMs, including pruning, quantization, and parallelization. Pruning eliminates redundant parameters to introduce sparsity, improving memory utilization and processing speed. On the other hand, Quantization reduces precision by converting floating-point computations to lower-bit integer formats like INT8 or INT4, enhancing hardware efficiency and energy savings. Additionally, parallelization techniques, such as tensor and pipeline parallelism, distribute workloads across multiple processing units to accelerate inference while addressing communication overhead. Recent innovations have also explored architectural modifications at the layer level, including layer fusion and dynamic recurrent execution, to streamline computational graphs. However, research has yet to focus on fusing consecutive layers through tensor parallelism, presenting an open avenue for optimizing inference further.

Researchers from the University of Geneva, EPFL, and Meta FAIR propose a method to reduce the depth of pre-trained LLMs while preserving performance. Modifying the computational graph enables parallel execution of grouped layer pairs, improving inference speed by approximately 1.20× without requiring retraining. Their approach maintains 95%-99% accuracy across perplexity and In-Context Learning (ICL) benchmarks. Additionally, fine-tuning helps recover minor performance losses. This method significantly enhances efficiency for large-scale LLM deployment, demonstrating that structural transformations, such as layer merging and reordering, can optimize computational workload while sustaining model effectiveness.

The study examines the effective depth of LLMs by applying transformations such as shuffling, merging, and pruning layers. Results indicate weak dependencies between intermediary layers, enabling certain layers to be reordered or parallelized with minimal perplexity loss. Running contiguous layers in parallel reduces depth while preserving performance, highlighting layer independence. Further, Layer Parallelism distributes computations across GPUs, optimizing efficiency through tensor parallelism. Modifications to attention and feed-forward networks ensure effective parallel execution. Adjustments to layer normalization help maintain stability. These findings suggest that transformer models can leverage parallelism to enhance computational efficiency without requiring substantial architectural modifications.

The study evaluates layer parallelism regarding inference speed, ICL accuracy, and fine-tuning for performance recovery. Experiments use Llama2 7B and Llama3.2 3B on dual A100 GPUs. Layer Parallelism is applied to merged layers, with Tensor Parallelism elsewhere. Results show that beyond 14 layers for Llama2 7B and 10 for Llama3.2 3B, ICL accuracy declines. Speed improves proportionally, reaching a 1.38x boost at aggressive parallelism. Fine-tuning parallelized layers on RedPajama data significantly restores accuracy, improving MMLU from 83.6% to 94.4% while maintaining speed gains, demonstrating the viability of Layer Parallelism with targeted adjustments.

In conclusion, the study introduces Layer Parallelism (LP), which restructures transformer computation by executing layer pairs in parallel, improving inference speed without retraining. Applied to Llama2 7B and Llama3.2 3B, LP reduced model depth by 21% and 18%, yielding speed-ups of 1.29x and 1.22x, respectively. Fine-tuning recovered 10.8% of lost accuracy, proving its effectiveness. These findings challenge the notion that transformer layers must process sequentially, suggesting selective parallelization is viable. LP enhances LLM efficiency in production, with future work exploring optimal layer grouping, interactions with quantization, and deeper theoretical insights into layer independence and computational efficiency.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System(Promoted)

The post Layer Parallelism: Enhancing LLM Inference Efficiency Through Parallel Execution of Transformer Layers appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

层并行 LLM推理 Transformer 计算效率
相关文章