未知数据源 2024年10月02日
Fast and Expressive LLM Inference with RadixAttention and SGLang
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SGLang 是一种为大型语言模型 (LLM) 设计的结构化生成语言,旨在提升 LLM 在复杂任务中的执行效率和可控性。它通过协同设计后端运行时系统和前端语言来实现这一目标。在后端,SGLang 引入了 RadixAttention 技术,该技术可自动高效地在多个 LLM 生成调用中重复使用 KV 缓存,从而减少重复的内存和计算。在前端,SGLang 开发了一种灵活的嵌入 Python 的领域特定语言,用于控制生成过程,该语言既可以在解释器模式下运行,也可以在编译器模式下运行。这些组件协同工作,提升了复杂 LLM 程序的执行和编程效率。SGLang 已被用于实现常见的 LLM 工作负载,包括代理、推理、提取、聊天和少样本学习任务,并在 NVIDIA A10G GPU 上使用 Llama-7B 和 Mixtral-8x7B 模型进行了测试。测试结果表明,SGLang 的吞吐量比现有的系统(如 Guidance 和 vLLM)高出 5 倍。

🚀 **RadixAttention:自动 KV 缓存重用** 在复杂 LLM 程序中,不同的提示可能共享相同的前缀,从而可以重用中间 KV 缓存,避免重复的内存和计算。SGLang 引入了 RadixAttention 技术,它使用一种称为基数树的数据结构来管理 KV 缓存的重用。基数树可以有效地存储和查找不同长度的字符串前缀,并支持 LRU 驱逐策略,以确保缓存命中率。RadixAttention 可以自动处理各种 KV 缓存重用模式,无需手动配置。 例如,在少样本学习中,多个问题可以共享相同的样本,从而可以重用相应的 KV 缓存。在多轮对话中,可以重用先前对话的历史记录,从而减少每次生成所需的计算量。 此外,RadixAttention 与现有的技术(如连续批处理和分页注意力)兼容,并可以轻松扩展以处理多模态模型中的图像标记。

💻 **SGLang 前端:简化 LLM 编程** SGLang 是一种嵌入 Python 的领域特定语言,它提供了一组易于使用的 API,用于表达高级提示技术、控制流、多模态、解码约束和外部交互。SGLang 可以通过各种后端运行,例如 OpenAI、Anthropic、Gemini 和本地模型。 SGLang 提供了以下关键功能: - **fork:**创建多个并行提示副本,用于并行生成。 - **gen:**调用 LLM 生成,并将结果存储在变量中。该调用是非阻塞的,因此可以同时运行多个生成调用。 - **[变量名称]:**检索生成结果。 - **choices:**对生成结果施加约束。 - **run:**执行 SGLang 函数及其参数。 SGLang 程序可以像数据流图一样被追踪,并使用图执行器执行,这为编译器优化提供了可能,例如代码移动、指令选择和自动调整。

🌟 **性能提升** SGLang 在各种 LLM 工作负载上进行了测试,包括 MMLU、HellaSwag、ReAct 代理、思维树、JSON 解码、聊天(短)和聊天(长)、DSPy RAG 和 LLaVA Bench。测试结果表明,SGLang 在大多数工作负载上的吞吐量比现有的系统(如 Guidance 和 vLLM)高出 5 倍。这表明 SGLang 可以有效地提升 LLM 程序的执行效率。 SGLang 的代码和技术报告已公开发布。

by: Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, Jan 17, 2024Large Language Models (LLMs) are increasingly utilized for complex tasks that require multiple chained generation calls, advanced prompting techniques, control flow, and interaction with external environments. However, there is a notable deficiency in efficient systems for programming and executing these applications.To address this gap, we introduce SGLang, a Structured Generation Language for LLMs. SGLang enhances interactions with LLMs, making them faster and more controllable by co-designing the backend runtime system and the frontend languages.On the backend, we propose RadixAttention, a technique for automatic and efficient KV cache reuse across multiple LLM generation calls.On the frontend, we develop a flexible domain-specific language embedded in Python to control the generation process. This language can be executed in either interpreter mode or compiler mode.These components work synergistically to enhance the execution and programming efficiency of complex LLM programs.We use SGLang to implement common LLM workloads, including agent, reasoning, extraction, chat, and few-shot learning tasks, employing the Llama-7B and Mixtral-8x7B models on NVIDIA A10G GPUs. Figures 1 and 2 below demonstrate that SGLang achieves up to 5 times higher throughput compared to existing systems, namely Guidance and vLLM.We have released the code and a tech report. Figure 1: Throughput of Different Systems on LLM Tasks (Llama-7B on A10G, FP16, Tensor Parallelism=1) Figure 2: Throughput of Different Systems on LLM Tasks (Mixtral-8x7B on A10G, FP16, Tensor Parallelism=8) In this blog post, we will begin by introducing the key optimizations we implemented in the backend, then move on to explaining the frontend APIs.Backend: Automatic KV Cache Reuse with RadixAttentionDuring the development of the SGLang runtime, we identified a crucial optimization opportunity for complex LLM programs, which are poorly handled by current systems: KV cache reuse. KV cache reuse means different prompts with the same prefix can share the intermediate KV cache and avoid redundant memory and computation.In a complex program that involves multiple LLM calls, there can be various KV cache reuse patterns.Figure 3 below illustrates four such patterns, which are common in LLM workloads.While some systems are capable of handling KV cache reuse in certain scenarios, this often necessitates manual configurations and ad-hoc adjustments. Moreover, no existing system can automatically accommodate all scenarios, even with manual configurations, due to the diversity of possible reuse patterns.Figure 3: KV cache sharing examples. Blue boxes are shareable prompt parts, green boxes are non-shareable parts, and yellow boxes are non-shareable model outputs. Shareable parts include few-shot learning examples, questions in self-consistency, chat history in multi-turn chat, and search history in tree-of-thought.To systematically exploit these reuse opportunities, we introduce RadixAttention, a novel technique for automatic KV cache reuse during runtime. Instead of discarding the KV cache after finishing a generation request, our approach retains the KV cache for both prompts and generation results in a radix tree. This data structure enables efficient prefix search, insertion, and eviction. We implement a Least Recently Used (LRU) eviction policy, complemented by a cache-aware scheduling policy, to enhance the cache hit rate.A radix tree is a data structure that serves as a space-efficient alternative to a trie (prefix tree). Unlike typical trees, the edges of a radix tree can be labeled with not just single elements, but also with sequences of elements of varying lengths. This feature boosts the efficiency of radix trees. In our system, we utilize a radix tree to manage a mapping. This mapping is between sequences of tokens, which act as the keys, and their corresponding KV cache tensors, which serve as the values. These KV cache tensors are stored on the GPU in a paged layout, where the size of each page is equivalent to one token. Considering the limited capacity of GPU memory, we cannot retrain infinite KV cache tensors, which necessitates an eviction policy. To tackle this, we implement an LRU eviction policy that recursively evicts leaf nodes.Furthermore, RadixAttention is compatible with existing techniques like continuous batching and paged attention.For multi-modal models, the RadixAttention can be easily extended to handle image tokens.The figure below illustrates how the radix tree is maintained when processing several incoming requests.The front end always sends full prompts to the runtime and the runtime will automatically do prefix matching, reuse, and caching.The tree structure is stored on the CPU and the maintenance overhead is small.Figure 4. Examples of RadixAttention operations with an LRU eviction policy, illustrated across nine steps.Figure 4 demonstrates the dynamic evolution of the radix tree in response to various requests. These requests include two chat sessions, a batch of few-shot learning inquiries, and a self-consistency sampling. Each tree edge carries a label denoting a substring or a sequence of tokens. The nodes are color-coded to reflect different states: green for newly added nodes, blue for cached nodes accessed during the time point, and red for nodes that have been evicted.In step (1), the radix tree is initially empty. In step (2), the server processes an incoming user message "Hello" and responds with the LLM output "Hi". The system prompt "You are a helpful assistant", the user message "Hello!", and the LLM reply "Hi!" are consolidated into the tree as a single edge linked to a new node. In step (3), a new prompt arrives and the server finds the prefix of the prompt (i.e., the first turn of the conversation) in the radix tree and reuses its KV cache. The new turn is appended to the tree as a new node. In step (4), a new chat session begins. The node ``b'' from (3) is split into two nodes to allow the two chat sessions to share the system prompt. In step (5), the second chat session continues. However, due to the memory limit, node "c" from (4) must be evicted. The new turn is appended after node "d" in (4). In step (6), the server receives a few-shot learning query, processes it, and inserts it into the tree. The root node is split because the new query does not share any prefix with existing nodes. In step (7), the server receives a batch of additional few-shot learning queries. These queries share the same set of few-shot examples, so we split node 'e' from (6) to enable sharing. In step (8), the server receives a new message from the first chat session. It evicts all nodes from the second chat session (node "g" and "h") as they are least recently used. In step (9), the server receives a request to sample more answers for the questions in node "j" from (8), likely for self-consistency prompting. To make space for these requests, we evict node "i", "k", and "l" in (8).In the future, we envision advanced multi-layer storage strategies and eviction policies can be developed.Frontend: Easy LLM Programming with SGLangOn the frontend, we introduce SGLang, a domain-specific language embedded in Python. It allows you to express advanced prompting techniques, control flow, multi-modality, decoding constraints, and external interaction easily.A SGLang function can be run through various backends, such as OpenAI, Anthropic, Gemini, and local models.Figure 5. The implementation of a multi-dimensional essay judge in SGLang.Figure 5 shows a concrete example. It implements a multi-dimensional essay judge utilizing the branch-solve-merge prompting technique.This function uses LLMs to evaluate the quality of an essay from multiple dimensions, merges the judgments, generates a summary, and assigns a final grade.The highlighted regions illustrate the use of SGLang APIs.(1) fork creates multiple parallel copies of a prompt.(2) gen invokes an LLM generation and stores the result in a variable. The call is non-blocking so it allows multiple generation calls to run simultaneously in the background.(3) [variable_name] retrieves the result of the generation.(4) choices imposes constraints on the generation.(5) run executes a SGLang function with its arguments.Given such an SGLang program, we can either execute it eagerly through an interpreter, or we can trace it as a dataflow graph and run it with a graph executor. The latter case opens room for some potential compiler optimizations, such as code movement, instruction selection, and auto-tuning. You can find more code examples in our GitHub repo and the details of compiler optimizations in our tech report.The syntax of SGLang is largely inspired by Guidance. However, we additionally introduce new primitives and handle intra-program parallelism and batching. All of these new features contribute to the great performance of SGLang.You can find more examples at our Github repo.BenchmarkWe tested our system on the following common LLM workloads and reported the achieved throughput:MMLU: A 5-shot, multi-choice, multi-task benchmark.HellaSwag: A 20-shot, multi-choice sentence completion benchmark.ReAct Agent: An agent task using prompt traces collected from the original ReAct paper.Tree-of-Thought: A custom tree search-based prompt for solving GSM-8K problems.JSON Decode: Extracting information from a Wikipedia page and outputting it in JSON format.Chat (short): A synthetic chat benchmark where each conversation includes 4 turns with short LLM outputs.Chat (long): A synthetic chat benchmark where each conversation includes 4 turns with long LLM outputs.DSPy RAG: A retrieval-augmented generation pipeline in the DSPy tutorial.LLaVA Bench: Running LLaVA v1.5, a vision language model on the LLaVA-in-the-wild benchmark.We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, and Hugging Face TGI v1.3.0 as baseline systems.As shown in Figures 1 and 2, SGLang outperformed the baseline systems in all benchmarks, achieving up to 5 times higher throughput. It also excelled in terms of latency, particularly for the first token latency, where a prefix cache hit can be significantly beneficial. These improvements are attributed to the automatic KV cache reuse with RadixAttention, the intra-program parallelism enabled by the interpreter, and the co-design of the frontend and backend systems.Additionally, our ablation study revealed no noticeable overhead even in the absence of cache hits, leading us to always enable the RadixAttention feature in the runtime.The benchmark code is available here.AdoptionSGLang has been used to power the serving of LLaVA online demo.It also also been integrated as a backend in DSPy.Please let us know if you have any interesting use cases!ConclusionAs LLMs continue to evolve, they have the potential to be seamlessly integrated into complex software stacks, revolutionizing software development practices. LLMs can effectively function as intelligent library functions. To ensure their speed, flexibility, reliability, and controllability, it is crucial to co-design both the programming interfaces and the runtime systems for LLM-based functions and programs. SGLang represents our initial step towards achieving this goal. We invite the community to try SGLang and provide us with feedback.LinksCode: https://github.com/sgl-project/sglang/Paper: https://arxiv.org/abs/2312.07104AcknowledgementThis project would not have been possible without the incredible open-source community. We gained insights from the designs and even reused some code from the following projects: Guidance, vLLM, LightLLM, FlashInfer, Outlines, LMQL.We thank Zihao Ye, Haotian Liu, Omar Khattab, Christopher Chou, and Wei-Lin Chiang for their early feedback.Citation@misc{zheng2023efficiently, title={Efficiently Programming Large Language Models using SGLang}, author={Lianmin Zheng and Liangsheng Yin and Zhiqiang Xie and Jeff Huang and Chuyue Sun and Cody Hao Yu and Shiyi Cao and Christos Kozyrakis and Ion Stoica and Joseph E. Gonzalez and Clark Barrett and Ying Sheng}, year={2023}, eprint={2312.07104}, archivePrefix={arXiv}, primaryClass={cs.AI}}

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SGLang 大型语言模型 LLM 结构化生成语言 RadixAttention KV 缓存重用 领域特定语言
相关文章