Nvidia Developer 02月16日
Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Qwen团队发布的Qwen2.5-Coder系列代码生成大语言模型,并探讨了如何利用NVIDIA TensorRT-LLM对其进行推理优化,以提高编码效率。文章重点介绍了TensorRT-LLM中的lookahead decoding优化技术,该技术无需额外训练或草稿模型,即可加速Qwen2.5-Coder的推理过程,实现多行代码自动补全。通过实验,展示了在NVIDIA H100和H200 GPU上,使用lookahead decoding后,Qwen2.5-Coder模型吞吐量的显著提升,并提供了在TensorRT-LLM中运行lookahead decoding的步骤。

🚀 Qwen2.5-Coder是Qwen团队发布的一系列先进的代码生成大语言模型,专注于代码生成、推理和修复,支持多种流行的编程语言。

💡 NVIDIA TensorRT-LLM针对Qwen2.5-Coder的1.5B、7B和32B版本进行了优化,利用动态in-flight batching、KV缓存和lookahead decoding等技术,实现了高吞吐量和低延迟。

🔍 Lookahead decoding是一种推测解码技术,通过并行生成多个token,充分利用GPU的并行处理能力,从而解决LLM自回归解码速度慢的问题。它将解码过程分为lookahead分支和验证分支,分别进行n-gram生成和验证。

⚙️ Lookahead decoding的性能取决于窗口大小(W)、n-gram大小(N)和验证集大小(G)这三个关键参数。通过调整这些参数,可以在NVIDIA H100和H200 GPU上实现Qwen2.5-Coder模型吞吐量的显著提升,最高可达3.6倍。

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents, these models assist developers with various tasks, including enhancing code, fixing bugs, generating tests, and writing documentation.To promote the development of open-source LLMs, the Qwen team recently released Qwen2.5-Coder, a family of advanced LLMs for code generation, reasoning, and fixing across popular programming languages. This post explores the benefits of inference optimizations for Qwen2.5-Coder models supported in NVIDIA TensorRT-LLM, and the ease of deployment with NVIDIA NIM for transformative potential and coding efficiency.Qwen2.5-Coder modelsThe Qwen2.5-Coder models have achieved state-of-the-art performance across popular academic benchmarks. NVIDIA TensorRT-LLM has optimized three popular models from the Qwen2.5-Coder family—the 1.5B, 7B, and 32B versions—for high throughput and low latency. TensorRT-LLM is a library for fast, efficient LLM inference and includes optimizations such as dynamic inflight batching, KV caching, KV cache reuse, and several speculative decoding techniques, among others. These optimizations help deliver performance improvements for the Qwen2.5-Coder models on popular programming languages such as Python, C++, Java, Bash, Javascript, TypeScript, and Go, reaching a wider range of developers. This post explores the lookahead decoding optimization and the performance boost it helps achieve. Without any additional training or need for additional draft models, developers can leverage the TensorRT-LLM high-level API to speed up Qwen2.5-Coder inference to generate multiline autocode completion.Lookahead decodingLookahead decoding is a speculative decoding technique that addresses the slow autoregressive nature of LLMs. Each autoregressive decoding step only generates one token at a time, not leveraging the massive parallel processing power of NVIDIA GPUs, leading to low GPU utilization and lower throughput. We’ve previously discussed the throughput boost achievable with draft target speculative decoding, and here we discuss the benefits of leveraging TensorRT-LLM lookahead decoding implementation using the Qwen2.5-Coder models as an example.Unlike the single-token generation in autoregressive decoding, lookahead decoding generates multiple tokens simultaneously, adequately utilizing the parallel processing capabilities of the GPU, leveraging computation (FLOPs) for latency reduction. Moreover, lookahead decoding doesn’t require a separate draft model that’s needed for draft target speculative decoding. Each decoding step is divided into two parallel branches, the lookahead branch and the verification branch. Using the Jacobi iteration method, a classic nonlinear systems solver, the lookhead branch performs parallel decoding for future tokens by generating n-grams. The verification branch selects and verifies the promising n-gram candidates generated by the lookahead branch.The lookahead algorithm is configured using three key parameters: window size (W), n-gram size (N), and verification set size (G). Window size (W): Represents the lookahead window size, which determines how many future tokens the algorithm attempts to predict in each step. Larger window size enables the model to look further, helping generate more tokens in a single pass. This effectively improves throughput performance while utilizing GPU computation FLOPs efficiently.N-gram size (N): Represents the size of the n-grams used in the lookahead process. For example, a 5-gram is a contiguous sequence of 5 future tokens. Together with the window size, it creates a fixed-sized, 2D window for the lookahead branch to generate n-grams from the Jacobi iteration trajectory.Verification set size (G): Represents the maximum number of speculations or candidate n-grams that the algorithm considers in each step for verification. It balances the trade-off between computation efficiency and exploring more possibilities.Figure 1. Lookahead decoding workflow with (W, N, G) = (5, 3, 2). Image credit: Break the Sequential Dependency of LLM Inference Using Lookahead DecodingLookahead performance greatly depends on the base model, hardware, batch size, sequence length, and the dataset. It is recommended to profile various configurations to find the best (W, N, G) configuration given the setup. Optimal (W, N, G) tuple configuration enables lookahead decoding to deliver improved throughput performance without the need for any additional training, fine-tuning or draft models. Through our experiments on (W, N, G) configuration values sweep, we achieve 3.6x and 1.6x throughput speedups for Qwen2.5-Coder 7B Instruct and Qwen2.5-Coder 32B Instruct models, respectively. These speedups are measured in throughput (tokens/second) compared to baseline (no lookahead speculative decoding) on NVIDIA H100 Tensor Core GPUs, as shown in Figure 2. Figure 2. Qwen2.5-Coder models throughput boost on NVIDIA DGX H100 with TensorRT-LLM lookahead decodingData measured on 01/30/2025. Inference throughput (output tokens/second) speedups of Qwen2.5-Coder 7B Instruct and Qwen2.5-Coder 32B Instruct models. DGX H100, TP=1 | (W, N, G) = (8, 8, 8) | Qwen2.5-Coder 7B Instruct, TP=2 | (W, N, G) = (15, 15, 15) | Qwen2.5-Coder-32B-Instruct, batch size=1, TensorRT-LLM version 0.15.0​. Similar throughput speedups are achieved on NVIDIA H200 Tensor Core GPUs. With their higher memory bandwidth, they also help raise the baseline throughput performance leading to slightly lower speedups as compared to H100 GPUs (Figure 3).Figure 3. Qwen2.5-Coder models throughput boost on NVIDIA DGX H200 with TensorRT-LLM lookahead decodingData measured on 01/30/2025. Inference throughput (output tokens/second) speedups of Qwen2.5-Coder 7B Instruct and Qwen2.5-Coder 32B Instruct models. DGX H200, TP=1 | (W, N, G) = (8, 8, 8) | Qwen2.5-Coder 7B Instruct, TP=2 | (W, N, G) = (15, 15, 15) | Qwen2.5-Coder 32B Instruct, batch size=1, TensorRT-LLM version 0.15.0​.Steps to run lookahead decoding with TensorRT-LLMTo reproduce these performance gains using lookahead speculative decoding within TensorRT-LLM, follow the steps below.# Install TensorRT-LLM. (Commands below are for Linux. Refer to TensorRT-LLM docs for Windows)sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade setuptools && pip3 install tensorrt_llm --extra-index-url https://pypi.nvidia.comThen run lookahead decoding in TensorRT-LLM using the high-level API. # Command for Qwen2.5-Coder-7B-Instructfrom tensorrt_llm import LLM, SamplingParamsfrom tensorrt_llm.llmapi import (LLM, BuildConfig, KvCacheConfig, LookaheadDecodingConfig, SamplingParams)def main(): """The end user can customize the build configuration with the build_config class. # Max draft length is based on (W,N,G) values and calculated as: (W + G -1) * (N-1) + ( N<=1 ? 0: N-2)""" build_config = BuildConfig(max_batch_size = 128,max_input_len = 2048, max_seq_len = 4096,max_num_tokens = 16384, max_draft_len = 111) build_config.plugin_config.reduce_fusion = True build_config.plugin_config.use_paged_context_fmha = True build_config.plugin_config.multiple_profiles = True # The configuration for lookahead decoding lookahead_config = LookaheadDecodingConfig(max_window_size=8, max_ngram_size=8, max_verification_set_size=8) kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.4) llm = LLM(model="Qwen/Qwen2.5-Coder-7B-Instruct", kv_cache_config=kv_cache_config, build_config=build_config, speculative_config=lookahead_config) prompt = """Write a C++ program to find the nth Fibonacci number using recursion. Now we define a sequence of numbers in which each number is the sum of the three preceding ones. The first three numbers are 0, -1, -1. Write a program to find the nth number.""" sampling_params = SamplingParams(lookahead_config=lookahead_config) output = llm.generate(prompt, sampling_params=sampling_params) print(output)if name == 'main': main()SummaryLookahead speculative decoding enables throughput boost on LLMs without any additional training, fine-tuning, or draft models. We presented benchmarked performance improvements on Qwen2.5-Coder models. Visit build.nvidia.com to try the Qwen2.5-Coder models optimized with NVIDIA TensorRT-LLM for free. Qwen2.5-Coder models optimized with TensorRT-LLM have also been packaged as downloadable NVIDIA NIM microservices for ease of deployment. AcknowledgmentsWe would like to thank Liwei Ma, Fanrong Li, Nikita Korobov, and Martin Marciniszyn Mehringer  for their efforts in supporting this post.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Qwen2.5-Coder TensorRT-LLM Lookahead Decoding 代码生成 LLM优化
相关文章