Deploying DeepSeek on GB200 NVL72 with PD and Large Scale EP (Part I): 2.7x Higher Decoding Throughput

The GB200 NVL72 is the world's most advanced hardware for AI training and inference. In this blog post, we're excited to share early results from running DeepSeek 671B with prefill-decode disaggregation and large-scale expert parallelism on the GB200 NVL72. By leveraging Blackwell-specific features to enhance existing components, SGLang achieved 7,583 tokens per second per GPU for decoding on the GB200 NVL72—a 2.7x speedup compared to the H100 per GPU (link) for 2,000-token input lengths. Performance is expected to improve further with ongoing optimizations. You can find reproduction instructions here.

Methods

As a preliminary work, we integrated the following components into SGLang:

Blackwell DeepGEMM

Blackwell DeepEP

FlashInfer Blackwell FMHA

Blackwell CUTLASS MLA

2-SM

Blackwell Mooncake

Experiments

End-to-end Performance

To assess the decode performance of DeepSeek on the GB200 NVL72, we conducted end-to-end experiments with a comparison to H100. Since we are provided 14 nodes in a GB200 NVL72, we use 12 nodes for decode and the remaining for prefill. This scenario roughly mimics real-world cases when users use 6 nodes for prefill and 12 nodes for decode when having 18 nodes. To ensure consistency, the experimental setup mirrors that of our previous blog post and baselines are reused directly from it. The slow-down API in the previous blog is also used to make decode nodes saturated. The yellow baseline bar indicates simulated MTP in the previous blog, and other bars do not enable MTP. Thus, the speedup is higher than displayed if the same setup is used.

The results demonstrate a 2.5-3.4x performance speedup across various sequence lengths on the GB200 NVL72 compared to the H100, driven by several key factors:

Enhanced Memory Bandwidth and Computation Power

Larger Memory Capacity to Allow Larger Batch Sizes

Larger NVLink Domain

PD and Large-scale EP

Ablation Study: Batch Size

To better understand the impact of batch size on the system, we conducted an ablation study by testing a range of batch sizes, though the system has not been optimized for small batch sizes. The input and output lengths are set to 2000 and 100, respectively. The results show that larger batch sizes boost throughput. Meanwhile, the GB200 NVL72 hardware demonstrates faster performance at the same batch size compared to the H100.

Future Work

Our preliminary results already demonstrate a 2.5-3.4x speedup, but there is still significant potential for further improvements:

Other Hardware and Parallelism Configuration

Prefill Performance Enhancements

Latency-oriented Optimizations

Kernel Optimizations

Communication Overlap

Multi-Token Prediction (MTP)

Acknowledgement

We would like to express our heartfelt gratitude to the following teams and collaborators:

SGLang Core Team and Community Contributors — Jingyi Chen, Baizhou Zhang, Jiexin Liang, Qiaolin Yu, Yineng Zhang, Ke Bao, Liangsheng Yin and many others.

FlashInfer Team — Zihao Ye, Yong Wu, Yaxing Cai — for Blackwell FMHA kernel optimizations.

Mooncake Team — Shangming Cai, Feng Ren, Teng Ma, Mingxing Zhang, and colleagues — for their collaboration on PD disaggregation in SGLang.

NVIDIA Team — including members from Hardware (Juan Yu), DevTech (Yingcan Wei, Shifang Xu, Hui Wang, Kai Sun), DGX Cloud (Paul Abruzzo, Mathew Wicks, Lee Ditiangkin, Carl Nygard) and Enterprise Product (Pen Li, Trevor Morris, Elfie Guo, Kaixi Hou, Kushan Ahmadian, Pavani Majety) — for their contributions to Blackwell DeepGEMM and DeepEP, support with GB200 NVL72 operations and DeepSeek kernel optimizations.

Dynamo Team — Kyle Kranen, Vikram Sharma Mailthody, and colleagues - for extra support on PD disaggregation in SGLang.

Methods

Experiments

End-to-end Performance

Ablation Study: Batch Size

Future Work

Acknowledgement

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签