Optimizing Qwen2.5-Coder Throughput with NVIDIA TensorRT-LLM Lookahead Decoding

Large language models (LLMs) that specialize in coding have been steadily adopted into developer workflows. From pair programming to self-improving AI agents, these models assist developers with various tasks, including enhancing code, fixing bugs, generating tests, and writing documentation.To promote the development of open-source LLMs, the Qwen team recently released Qwen2.5-Coder, a family of advanced LLMs for code generation, reasoning, and fixing across popular programming languages. This post explores the benefits of inference optimizations for Qwen2.5-Coder models supported in NVIDIA TensorRT-LLM, and the ease of deployment with NVIDIA NIM for transformative potential and coding efficiency.Qwen2.5-Coder modelsThe Qwen2.5-Coder models have achieved state-of-the-art performance across popular academic benchmarks. NVIDIA TensorRT-LLM has optimized three popular models from the Qwen2.5-Coder family—the 1.5B, 7B, and 32B versions—for high throughput and low latency. TensorRT-LLM is a library for fast, efficient LLM inference and includes optimizations such as dynamic inflight batching, KV caching, KV cache reuse, and several speculative decoding techniques, among others. These optimizations help deliver performance improvements for the Qwen2.5-Coder models on popular programming languages such as Python, C++, Java, Bash, Javascript, TypeScript, and Go, reaching a wider range of developers. This post explores the lookahead decoding optimization and the performance boost it helps achieve. Without any additional training or need for additional draft models, developers can leverage the TensorRT-LLM high-level API to speed up Qwen2.5-Coder inference to generate multiline autocode completion.Lookahead decodingLookahead decoding is a speculative decoding technique that addresses the slow autoregressive nature of LLMs. Each autoregressive decoding step only generates one token at a time, not leveraging the massive parallel processing power of NVIDIA GPUs, leading to low GPU utilization and lower throughput. We’ve previously discussed the throughput boost achievable with draft target speculative decoding, and here we discuss the benefits of leveraging TensorRT-LLM lookahead decoding implementation using the Qwen2.5-Coder models as an example.Unlike the single-token generation in autoregressive decoding, lookahead decoding generates multiple tokens simultaneously, adequately utilizing the parallel processing capabilities of the GPU, leveraging computation (FLOPs) for latency reduction. Moreover, lookahead decoding doesn’t require a separate draft model that’s needed for draft target speculative decoding. Each decoding step is divided into two parallel branches, the lookahead branch and the verification branch. Using the Jacobi iteration method, a classic nonlinear systems solver, the lookhead branch performs parallel decoding for future tokens by generating n-grams. The verification branch selects and verifies the promising n-gram candidates generated by the lookahead branch.The lookahead algorithm is configured using three key parameters: window size (W), n-gram size (N), and verification set size (G). Window size (W): Represents the lookahead window size, which determines how many future tokens the algorithm attempts to predict in each step. Larger window size enables the model to look further, helping generate more tokens in a single pass. This effectively improves throughput performance while utilizing GPU computation FLOPs efficiently.N-gram size (N): Represents the size of the n-grams used in the lookahead process. For example, a 5-gram is a contiguous sequence of 5 future tokens. Together with the window size, it creates a fixed-sized, 2D window for the lookahead branch to generate n-grams from the Jacobi iteration trajectory.Verification set size (G): Represents the maximum number of speculations or candidate n-grams that the algorithm considers in each step for verification. It balances the trade-off between computation efficiency and exploring more possibilities.Figure 1. Lookahead decoding workflow with (W, N, G) = (5, 3, 2). Image credit: Break the Sequential Dependency of LLM Inference Using Lookahead DecodingLookahead performance greatly depends on the base model, hardware, batch size, sequence length, and the dataset. It is recommended to profile various configurations to find the best (W, N, G) configuration given the setup. Optimal (W, N, G) tuple configuration enables lookahead decoding to deliver improved throughput performance without the need for any additional training, fine-tuning or draft models. Through our experiments on (W, N, G) configuration values sweep, we achieve 3.6x and 1.6x throughput speedups for Qwen2.5-Coder 7B Instruct and Qwen2.5-Coder 32B Instruct models, respectively. These speedups are measured in throughput (tokens/second) compared to baseline (no lookahead speculative decoding) on NVIDIA H100 Tensor Core GPUs, as shown in Figure 2. Figure 2. Qwen2.5-Coder models throughput boost on NVIDIA DGX H100 with TensorRT-LLM lookahead decodingData measured on 01/30/2025. Inference throughput (output tokens/second) speedups of Qwen2.5-Coder 7B Instruct and Qwen2.5-Coder 32B Instruct models. DGX H100, TP=1 | (W, N, G) = (8, 8, 8) | Qwen2.5-Coder 7B Instruct, TP=2 | (W, N, G) = (15, 15, 15) | Qwen2.5-Coder-32B-Instruct, batch size=1, TensorRT-LLM version 0.15.0. Similar throughput speedups are achieved on NVIDIA H200 Tensor Core GPUs. With their higher memory bandwidth, they also help raise the baseline throughput performance leading to slightly lower speedups as compared to H100 GPUs (Figure 3).Figure 3. Qwen2.5-Coder models throughput boost on NVIDIA DGX H200 with TensorRT-LLM lookahead decodingData measured on 01/30/2025. Inference throughput (output tokens/second) speedups of Qwen2.5-Coder 7B Instruct and Qwen2.5-Coder 32B Instruct models. DGX H200, TP=1 | (W, N, G) = (8, 8, 8) | Qwen2.5-Coder 7B Instruct, TP=2 | (W, N, G) = (15, 15, 15) | Qwen2.5-Coder 32B Instruct, batch size=1, TensorRT-LLM version 0.15.0.Steps to run lookahead decoding with TensorRT-LLMTo reproduce these performance gains using lookahead speculative decoding within TensorRT-LLM, follow the steps below.# Install TensorRT-LLM. (Commands below are for Linux. Refer to TensorRT-LLM docs for Windows)sudo apt-get -y install libopenmpi-dev && pip3 install --upgrade setuptools && pip3 install tensorrt_llm --extra-index-url https://pypi.nvidia.comThen run lookahead decoding in TensorRT-LLM using the high-level API. # Command for Qwen2.5-Coder-7B-Instructfrom tensorrt_llm import LLM, SamplingParamsfrom tensorrt_llm.llmapi import (LLM, BuildConfig, KvCacheConfig, LookaheadDecodingConfig, SamplingParams)def main(): """The end user can customize the build configuration with the build_config class. # Max draft length is based on (W,N,G) values and calculated as: (W + G -1) * (N-1) + ( N<=1 ? 0: N-2)""" build_config = BuildConfig(max_batch_size = 128,max_input_len = 2048, max_seq_len = 4096,max_num_tokens = 16384, max_draft_len = 111) build_config.plugin_config.reduce_fusion = True build_config.plugin_config.use_paged_context_fmha = True build_config.plugin_config.multiple_profiles = True # The configuration for lookahead decoding lookahead_config = LookaheadDecodingConfig(max_window_size=8, max_ngram_size=8, max_verification_set_size=8) kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.4) llm = LLM(model="Qwen/Qwen2.5-Coder-7B-Instruct", kv_cache_config=kv_cache_config, build_config=build_config, speculative_config=lookahead_config) prompt = """Write a C++ program to find the nth Fibonacci number using recursion. Now we define a sequence of numbers in which each number is the sum of the three preceding ones. The first three numbers are 0, -1, -1. Write a program to find the nth number.""" sampling_params = SamplingParams(lookahead_config=lookahead_config) output = llm.generate(prompt, sampling_params=sampling_params) print(output)if name == 'main': main()SummaryLookahead speculative decoding enables throughput boost on LLMs without any additional training, fine-tuning, or draft models. We presented benchmarked performance improvements on Qwen2.5-Coder models. Visit build.nvidia.com to try the Qwen2.5-Coder models optimized with NVIDIA TensorRT-LLM for free. Qwen2.5-Coder models optimized with TensorRT-LLM have also been packaged as downloadable NVIDIA NIM microservices for ease of deployment. AcknowledgmentsWe would like to thank Liwei Ma, Fanrong Li, Nikita Korobov, and Martin Marciniszyn Mehringer for their efforts in supporting this post.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签