MarkTechPost@AI 2024年11月14日
Researchers from Snowflake and CMU Introduce SuffixDecoding: A Novel Model-Free Approach to Accelerating Large Language Model (LLM) Inference through Speculative Decoding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型(LLM)在消费和企业应用中扮演着越来越重要的角色,但其令牌生成速度一直是瓶颈。为了解决这一问题,研究人员提出了SuffixDecoding,这是一种无需模型的加速方法。SuffixDecoding利用基于先前输出和当前推理请求构建的后缀树索引,避免了依赖草稿模型或额外解码头的复杂性。这种方法在复杂的多阶段LLM管道中表现出色,例如AgenticSQL数据集,实现了高达2.9倍的输出吞吐量和3倍的每令牌时间延迟降低。SuffixDecoding通过扩展参考语料库而不是依赖草稿模型,为提高推测解码效率提供了一种有效途径,有助于释放大型语言模型在现实应用中的全部潜能。

🤔 **SuffixDecoding 是一种无需模型的加速大型语言模型推理的方法,旨在解决令牌生成速度慢的问题。** 它利用后缀树索引,避免了依赖草稿模型或额外解码头的复杂性,从而提高了效率和降低了GPU开销。

🚀 **SuffixDecoding 在复杂的多阶段LLM管道中表现出色。** 例如,在AgenticSQL数据集上,它实现了高达2.9倍的输出吞吐量和3倍的每令牌时间延迟降低,显著提升了LLM的性能。

💡 **SuffixDecoding 通过扩展参考语料库而不是依赖草稿模型来提高推测解码效率。** 这种方法能够利用更广泛的先前输出数据,从而更准确地预测后续令牌,提升了推测解码的可靠性和效率。

📊 **SuffixDecoding 在开放式任务(如聊天和代码生成)中也表现出了良好的性能。** 与SpecInfer基线相比,它实现了高达1.4倍的吞吐量和1.1倍的每令牌时间延迟降低,证明了其在不同场景下的适用性。

🔍 **SuffixDecoding 能够生成更准确的推测令牌,从而最大化推测解码带来的加速效果。** 它在验证步骤中实现了更高的平均接受推测令牌数量,表明其模型无关的后缀树结构能够更有效地进行推测令牌生成。

Large language models (LLMs) have rapidly become a foundational component of today’s consumer and enterprise applications. However, the need for a fast generation of tokens has remained a persistent challenge, often becoming a bottleneck in emerging applications. For example, the recent trend of inference-time scaling utilizes much longer outputs to perform search and other complex algorithms, while multi-agent and pipelined LLM systems aim to enhance accuracy and reliability, but both often suffer from long response times due to the wait for multiple processing stages. Addressing this need for accelerated token generation is crucial for the continued advancement and widespread adoption of LLM-powered applications.

Existing model-based speculative decoding methods have limitations that hinder their ability to effectively address the challenge of accelerating token generation in LLMs. First, these methods rely heavily on the size and quality of the draft model, which may not always be available, requiring costly training or fine-tuning to create a suitable model. Second, the integration of draft models and LLMs on GPUs can lead to complications and inefficiencies, such as conflicts between the draft model’s memory usage and the LLM’s key-value cache. To address these issues, recent work has explored incorporating additional decoding heads directly within the LLM to perform speculative decoding. However, these approaches still face similar challenges, as the additional heads require fine-tuning for each LLM and consume significant GPU memory. Overcoming these limitations is crucial for developing more robust and efficient techniques to accelerate LLM inference.

Researchers from Snowflake AI Research and Carnegie Mellon University introduce SuffixDecoding, a robust model-free approach that avoids the need for draft models or additional decoding heads. Instead of relying on separate models, SuffixDecoding uitlizes efficient suffix tree indices built upon previous output generations and the current ongoing inference request. The process begins by tokenizing each prompt-response pair using the LLM’s vocabulary, extracting all possible suffixes (subsequences from any position to the end) to construct the suffix tree structure. Each node in the tree represents a token, and the path from the root to any node corresponds to a subsequence that appeared in the training data. This model-free approach eliminates the complications and GPU overhead associated with integrating draft models or additional decoding heads, presenting a more efficient alternative for accelerating LLM inference.

For each new inference request, SuffixDecoding constructs a separate per-request suffix tree from the current prompt tokens. This design is crucial for tasks where the LLM output is expected to reference or reuse content from the input prompt, such as document summarization, question-answering, multi-turn chat conversations, and code editing. The suffix tree maintains frequency counts at each node to track how often different token sequences occur, enabling efficient pattern matching. Given any sequence of recent tokens from the current generation, SuffixDecoding can quickly traverse the tree to find all possible continuations that appeared in the prompt or previous outputs. At each inference step, SuffixDecoding selects the best subtree(s) of continuation tokens based on frequency statistics and empirical probability. These speculated tokens are then passed to the LLM for verification, which is carried out in a single forward pass thanks to a tree attention operator with a topology-aware causal mask.

Similar to prior work like LLMA and Prompt Lookup Decoding, SuffixDecoding is a model-free approach that sources candidate sequences from a reference corpus. However, unlike previous methods that only considered small reference texts such as a handful of snippets or just the current prompt, SuffixDecoding is designed to utilize a much larger-scale corpus, consisting of hundreds or even thousands of previously generated outputs.

By operating on this larger reference corpus, SuffixDecoding can utilize frequency statistics in a more principled fashion to select likely candidate sequences. To enable fast production of these candidate sequences, SuffixDecoding builds a suffix tree over its reference corpus. The root node of the tree represents the beginning of a suffix from any document in the corpus, where a document is an output of a previous inference or the prompt and output of the current ongoing inference. The path from the root to each node represents a subsequence that appears in the reference corpus, and each child node represents a possible token continuation.

SuffixDecoding uses this suffix tree structure to perform efficient pattern matching. Given the prompt plus generated tokens of the current inference, it identifies a pattern sequence and walks the suffix tree to find all possible continuations that appeared in the reference corpus. While this can produce a large set of candidate sequences, SuffixDecoding employs a greedy expansion and scoring procedure to build a smaller, more likely speculation tree, which is then used in the final tree-based speculative decoding step.

The end-to-end experimental results demonstrate the strengths of the SuffixDecoding approach. On the AgenticSQL dataset, which represents a complex, multi-stage LLM pipeline, SuffixDecoding achieves up to 2.9x higher output throughput and up to 3x lower time-per-token (TPOT) latency compared to the SpecInfer baseline. For more open-ended tasks like chat and code generation, SuffixDecoding still delivers strong performance, with up to 1.4x higher throughput and 1.1x lower TPOT latency than SpecInfer.

The evaluation also examines the effectiveness of SuffixDecoding’s speculative decoding capabilities. SuffixDecoding can achieve a significantly higher average number of accepted speculated tokens per verification step compared to the draft-model-based SpecInfer approach. This indicates SuffixDecoding’s model-free suffix tree structure enables more accurate and reliable speculative token generation, maximizing the potential speedup from speculative decoding without the overhead of maintaining a separate draft model.

This work presents SuffixDecoding, a model-free approach to accelerating LLM inference by utilizing suffix trees built from previous outputs. SuffixDecoding achieves competitive speedups against existing model-based speculative decoding methods across diverse workloads while being particularly well-suited for complex, multi-stage LLM pipelines. By scaling the reference corpus rather than relying on draft models, SuffixDecoding demonstrates a robust direction for improving speculative decoding efficiency and unlocking the full potential of large language models in real-world applications.


Check out the Details here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate Transactions

The post Researchers from Snowflake and CMU Introduce SuffixDecoding: A Novel Model-Free Approach to Accelerating Large Language Model (LLM) Inference through Speculative Decoding appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 LLM推理 SuffixDecoding 推测解码 后缀树
相关文章