Large Model Systems Organization 07月18日 06:19
Accelerating SGLang with Multiple Token Prediction
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SGLang推出了业界首个开源的、支持多令牌预测(MTP)与大规模专家并行(EP)及预填充-解码分离(Prefill-Decode disaggregation)相结合的推理框架。通过创新的解码范式、更优的并行策略和高效的资源利用,SGLang实现了高达60%的输出吞吐量提升,且不牺牲生成质量。该框架现已集成MTP作为即插即用功能,为DeepSeek V3等模型带来显著性能飞跃,使AI从业者能够轻松扩展推理能力,以最高性能释放先进语言模型的推理潜能。

✨ SGLang框架创新性地集成了多令牌预测(MTP)技术,这是一种先进的推测性解码方法,通过轻量级草稿模型预测多个未来令牌,并由完整模型进行并行验证,从而加速文本生成过程。与传统逐个生成令牌的方式相比,MTP将串行操作转变为并行验证,显著提升了GPU利用率和整体推理吞吐量。

🚀 在性能表现上,SGLang集成MTP后,以DeepSeek V3模型为例,在小规模部署场景下实现了高达60%的输出吞吐量提升(从基线51 tokens/sec提升至82 tokens/sec),平均接受长度达到2.44个令牌。即使在大规模集群部署中,MTP也能带来14.2%的吞吐量增长,证明了其在不同规模和负载下的有效性。

💡 MTP通过提高“接受长度”(即每次验证跳过的令牌数)来实现加速,且不影响生成质量和确定性。所有预测的令牌都经过完整模型的验证,确保输出与标准解码一致。SGLang还整合了数据并行注意力、专家并行负载均衡、预填充-解码分离等多种高级功能,进一步优化了LLM的推理性能。

🛠️ SGLang为用户提供了MTP的最佳实践建议,通常建议将`draft_token_num`设置为2或3,以在性能和系统负载之间取得平衡。用户可根据GPU余量和实际的令牌接受率来调整此参数,以获得最佳的性能表现。未来,SGLang将继续优化大规模部署性能,并计划集成MTP与重叠调度功能,以期获得更大的性能提升。

TL;DR

SGLang is the first and only open-source serving framework to support Multiple Token Prediction (MTP) in combination with Large-Scale Expert Parallelism (EP) and Prefill-Decode disaggregation. This integration delivers up to 60% higher output throughput through a new decoding paradigm, better parallelism, and more efficient resource utilization without sacrificing generation quality. If you are serving models, e.g., DeepSeek V3, SGLang now supports MTP as a plug-and-play feature, unlocking immediate performance gains. You can find instruction for reproduction here.

SGLang’s inference framework running on NVIDIA GPUs enables AI practitioners to easily deliver inference at scale, empowering end users to “think smart” and harness the reasoning capabilities of state-of-the-art language models at the highest performance.

Introduction

While large language models continue to grow in capability, their token-by-token decoding process remains fundamentally sequential, creating a critical bottleneck for inference throughput. This limitation becomes especially apparent in high-demand applications, where maximizing GPU utilization is crucial for achieving high performance and cost-efficient deployment.

To address this, SGLang brings Multiple Token Prediction (MTP) to the open-source inference ecosystem, an advanced speculative decoding technique that accelerates generation by predicting multiple draft tokens with a lightweight draft model and verifying them in parallel using a single pass of the full model. In our benchmarks, MTP unlocks up to 60% higher output throughput for DeepSeek V3 without any loss in generation quality. With MTP now fully integrated, SGLang continues to push the frontier of open-source LLM serving, offering advanced decoding capabilities previously confined to proprietary systems, and making them accessible and production-ready.

What is Multiple Token Prediction (MTP)?

Traditional autoregressive decoding generates one token at a time, depending on all previous tokens. This serial process limits parallelism and speed.

MTP is a form of speculative decoding technique that accelerates generation by using a lightweight draft model to rapidly propose multiple future tokens, which are then verified in parallel by the full target model in a single pass.

MTP works by dividing the generation into two stages:

-Drafting: The lightweight draft model predicts one or more short sequence candidate(s) of n tokens in a single fast pass. Here we use one sequence candidate as an example.
(1) “Today is a sunny” is the current prefix produced by the target model.
(2) “day” is first generated by the target model's extend/prefill stage.

(3) “and” is the first draft token generated by the draft model's extend/prefill stage.
(4) “it’s so hot” are the three extra draft tokens generated by the draft model decoding iterations; In the example case, n=4 for “and it’s so hot”.

-Verification: The full target model then verifies all n draft tokens in parallel, accepting the longest prefix that matches its own output and resampling the rest if needed.
Let’s walk through an example with n = 4:

    The target model first generates the initial token after the extend/prefill stage:
    “day”The draft model then speculates the next token after extend/prefill and 3 more tokens after autoregressive decoding:
    “and it’s so hot”The target model verifies the full sequence:
    → it agrees with “and it’s”
    → It rejects “so hot”, and instead resamples “very”

Why MTP is Fast

The key to MTP’s speedup is parallelism. Crucially, the verification step is fully parallelized on GPUs, replacing n sequential decode steps with a single parallel verification pass.

The effectiveness of MTP depends on how many draft tokens are accepted per verification step, a metric known as the acceptance length. For instance, an average acceptance length of 2.4 means that, on average, 2.4 decode steps are skipped every time, resulting in substantial cumulative speedup during long sequences.

MTP does not compromise on generation quality or determinism. Every speculative token is still verified and approved by the same full model, ensuring identical outputs to standard decoding without any approximation or fine-tuning.

This new capability is fully integrated with SGLang’s advanced features, including:

    Data-Parallel Attention (DP Attention)Expert Parallelism Load Balancer (EPLB)DeepEP MoETwo Batch OverlapPrefill-Decode (PD) DisaggregationCUDA GraphVarious Attention Backends

Performance Evaluation

We present a comprehensive evaluation of the performance gains enabled by fully integrating MTP into the SGLang serving framework, using the DeepSeek V3 model as the testbed. This analysis includes two case studies designed to highlight improvements under both small-scale and large-scale deployment settings.

Deployment Scenarios and Design Motivation

The small-scale deployment configuration was selected based on production requirements from a high-profile generative AI company that operates a latency-sensitive, developer-facing product. Specifically, the company required a minimum sustained output of 60.4 tokens/sec per rank to meet application-level service-level agreements (SLAs). This requirement guided the configuration used in our first case study. To assess scalability under heavier loads, we also evaluate MTP in a large-scale cluster setup.

Case Study 1: Small-Scale Deployment

In this scenario, we deploy two decoding nodes across a total of 16 H200 GPUs, running 2 concurrent requests per rank with input sequence length of 65,536 tokens and output sequence length of 4,096 tokens. As baseline, we tested the case with no MTP and no overlap scheduling, and the system achieves an output throughput of 51 tokens/sec per rank. Using overlap scheduling alone, a feature introduced in SGLang v0.4, we achieved 60.4 tokens/sec per rank, meeting the production threshold without the need for MTP.When MTP is enabled, the system significantly surpasses this benchmark:

    With a 3-token MTP window and topk=1, the system achieves a throughput of 81.5 tokens/sec per rank, with an average acceptance length of 2.18 tokens.*With a 4-token MTP window and topk=1, throughput increases to 82.0 tokens/sec per rank, with an average acceptance length of 2.44 tokens.

    These results represent a +60% improvement in output throughput compared to the baseline (i.e., no overlap scheduling and no MTP). This case demonstrates that MTP yields substantial performance gains even in smaller cluster settings with modest concurrency levels, allowing for scalable performance even within constrained GPU resource budgets.

    Overlap SchedulingMTPThroughput (tokens/sec) Per Rank
    51.0 (baseline)
    60.4 (+20.4% ↑)
    ✅ 3-token81.5 (+59.8% ↑)
    ✅ 4-token82.0 (+60.8% ↑)

    Case Study 2: Large-Scale Deployment

    To evaluate scalability as well as demonstrate MTP support with large scale EP and Prefill-Decode disaggregation, we expand to a 16-node cluster comprising 128 H200 GPUs with 4 prefill nodes and 12 decoding nodes, running 128 concurrent requests per rank with input sequence length of 2,000 tokens and output sequence length of 100 tokens. In this high-throughput environment, we configure decoding with topk = 1, step size = 1, and draft_token_num = 2.

    When comparing MTP-enabled decoding to the baseline (i.e., no overlap scheduling and no MTP), we observe a +14.2% increase in output throughput, confirming that MTP provides measurable performance gains even at large scale under productionhttps://lmsys.org/images/blog/mtp/large_scale_throughput_hr.png/mtp/large_scale_throughput_hr.png" alt="Large-scale throughput graph">

    MTP Best Practices

    To get started with Multiple Token Prediction in SGLang, enable it in your configuration and set draft_token_num to 2, a balanced, low-risk choice that provides reliable performance gains across most workloads. For setups with available GPU headroom, you can increase draft_token_num to 4 or even larger to further boost throughput, though the returns may taper off depending on how well the system maintains token acceptance rates. On the other hand, if your GPUs are already handling large batches or running near capacity, keeping the draft size at 2 or 3 is generally more efficient and avoids introducing additional load.

    You can monitor acceptance rates in logs to fine-tune this parameter over time. If you're seeing average acceptance lengths consistently above 2, there's room to experiment with longer drafts. But if acceptance begins to drop, consider dialing it back to stay within your system’s comfort zone.

    Future Work

    -Large-Scale OptimizationWe are continuing to optimize performance for large-scale MTP deployments, focusing on scheduling efficiency and memory bandwidth utilization across multi-node systems.

    -Overlap Scheduling CompatibilityThe current MTP implementation does not yet support overlap scheduling. We anticipate additional performance gains once MTP and overlap scheduling are integrated. Development on this feature is ongoing.

    Acknowledgment

    We would like to express our heartfelt gratitude to the following teams and collaborators. In particular, we extend our sincere thanks to the NVIDIA DGX Cloud team for providing powerful GPUs and for their exceptional support in ensuring operational excellence:

    Eigen AI Team - Jinglei Cheng, Jiaqi Gu, Yipin Guo, Di Jin, Uill Liu, Zhijian Liu, Zilin Shen, Ryan Hanrui Wang, Wei-Chen Wang, Junyao Zhang and many others.

    SGLang Team and Community - Kavio Yu, Qiaolin Yu, Boxin Zhang, Jinfu Deng, Yineng Zhang and many others.

    xAI Team - Sehoon Kim, Ying Sheng, Lianmin Zheng, Sangbin Cho, Hanming Lu, Byron Hsu, Pranjal Shankhdhar, Cheng Wan and many others.

    NVIDIA Team - Pen Chung Li from Enterprise Products, Carl Nygard, Lee Ditiangkin, Nathan Fisher from DGX Cloud and many others.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SGLang MTP LLM推理 AI加速 开源框架
相关文章