MarkTechPost@AI 19小时前
This AI Paper from ByteDance Introduces MegaScale-Infer: A Disaggregated Expert Parallelism System for Efficient and Scalable MoE-Based LLM Serving
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ByteDance和北京大学的研究人员推出了MegaScale-Infer,这是一个创新的系统,旨在提高基于Mixture-of-Experts (MoE)的大型语言模型(LLM)的推理效率。该系统通过解耦注意力模块和前馈网络(FFN)模块,并采用定制的并行策略和高效的通信库,解决了MoE模型在推理过程中GPU利用率低下的问题。MegaScale-Infer在多种MoE模型上实现了显著的性能提升,降低了延迟,并优化了资源使用,为大规模LLM的部署提供了更高效、更经济的解决方案。

🧠 MegaScale-Infer的核心在于将MoE模型的注意力模块和FFN模块分离,允许针对每个模块定制扩展和并行策略。注意力模块因其内存密集型特性而被复制以聚合请求,而FFN模块则使用专家并行进行扩展,从而优化了资源利用。

🔄 为了进一步优化性能,MegaScale-Infer采用了乒乓流水线并行策略,将请求批次分解为较小的微批次,在注意力模块和FFN模块之间交替处理,确保每个组件都保持忙碌状态。系统根据计算时间、通信延迟和硬件设置来确定维持高利用率所需的最佳微批次数。

🚀 MegaScale-Infer集成了高性能M2N通信库,该库避免了不必要的GPU到CPU的数据复制,降低了延迟和不稳定性。该库用更高效的发送者-接收者模型取代了传统的All-to-All路由,专为MoE的token调度模式设计。

📈 在实验中,MegaScale-Infer在NVIDIA Ampere GPU上,与vLLM和TensorRT-LLM相比,将每个GPU的解码吞吐量提高了高达2.56倍和1.28倍。在异构集群上,该系统实现了高达3.24倍和1.86倍的每美元吞吐量提升。

Large language models are built on transformer architectures and power applications like chat, code generation, and search, but their growing scale with billions of parameters makes efficient computation increasingly challenging. Scaling such systems while maintaining low latency and high throughput puts pressure on algorithm design and system-level optimization. Effectively serving these models now requires careful orchestration of memory, communication, and compute resources.

A critical challenge in this space is how sparsity, introduced through Mixture-of-Experts (MoE) models, affects inference performance. These models selectively activate a subset of feed-forward networks per input, reducing computational load. However, this selective activation leads to underutilization of hardware. During inference, attention modules become bottlenecks due to frequent memory access to key-value caches, while the FFN modules become idle because each receives a small fraction of tokens. As a result, GPU utilization drops significantly, especially during decoding, creating inefficiencies and inflating operational costs.

While some methods like vLLM and TensorRT-LLM have attempted to address inference scaling through parallelism and optimized kernels, these solutions remain constrained. They process the model holistically, meaning they cannot independently adjust scaling for different components. As MoE models grow in size and sparsity, this approach leads to smaller active batches per expert, weakening the benefits of batching for FFNs. Moreover, tensor and pipeline parallelism approaches add communication overhead, especially across nodes, which becomes a limiting factor in multi-GPU environments.

ByteDance and Peking University researchers have introduced MegaScale-Infer, a system that rethinks the architecture of MoE serving. Instead of serving the model as a monolithic block, the researchers disaggregate the attention and FFN modules, deploying them on separate GPUs. This separation enables customized scaling and parallelism strategies tailored to the specific needs of each module. Attention modules, which are memory-intensive, are replicated to aggregate requests, while FFN modules are scaled using expert parallelism. The system also supports heterogeneous GPU deployment, assigning cost-effective memory-heavy GPUs to attention tasks and compute-optimized GPUs to FFNs. This disaggregation dramatically improves resource usage and flexibility in deployment.

To further optimize performance, MegaScale-Infer employs a ping-pong pipeline parallelism strategy. The idea is to break down batches of requests into smaller micro-batches that alternate between attention and FFN modules, ensuring that neither component sits idle. The system determines the optimal number of micro-batches required to maintain high utilization, considering compute time, communication latency, and hardware setup. For example, if the communication time is less than half the compute time, at least three micro-batches are used. Further, the system integrates a high-performance M2N communication library that avoids unnecessary GPU-to-CPU data copies, reducing latency and instability. This library replaces the traditional All-to-All routing with a more efficient sender-receiver model designed specifically for MoE’s token dispatch pattern.

MegaScale-Infer was tested on multiple large-scale MoE models, including Mixtral 8×22B, DBRX, and a scaled custom model with 317 billion parameters. In experiments on homogeneous setups using NVIDIA Ampere GPUs, MegaScale-Infer improved per-GPU decoding throughput by up to 2.56× compared to vLLM and 1.28× over TensorRT-LLM. The scaled model achieved a 7.11× gain over vLLM and a 1.90× gain over TensorRT-LLM. On heterogeneous clusters with H20 GPUs for attention and L40S for FFNs, the system achieved up to 3.24× and 1.86× higher throughput per dollar than the baselines. Its M2N communication library delivered up to 4.2× higher throughput and 68.2% lower latency than NCCL.

This paper presents a clear problem of underutilized GPUs during MoE inference and offers a practical solution by modularizing the architecture. The proposed disaggregation strategy, combined with micro-batch pipelining and a custom communication protocol, substantially impacts serving efficiency and cost.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post This AI Paper from ByteDance Introduces MegaScale-Infer: A Disaggregated Expert Parallelism System for Efficient and Scalable MoE-Based LLM Serving appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MegaScale-Infer MoE模型 LLM推理 GPU优化 ByteDance
相关文章