MarkTechPost@AI 2024年07月28日
SGLang: A Structured Generation Language for Efficient Execution of Complex Language Model Programs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SGLang是一种用于高效执行复杂语言模型程序的结构化生成语言,由斯坦福大学、加州大学伯克利分校、上海交通大学和德克萨斯A&M大学的研究人员提出。SGLang旨在解决现有方法在执行语言模型程序时的非确定性和低效率问题,通过提供一个语言前端和一个运行时后端,简化编程过程并加速执行。SGLang支持并行机制,能够处理多种输入模式,并提供用于控制并行性和生成的原生操作符,使开发者能够轻松构建复杂的提示流程。

🚀 **SGLang 解决了现有语言模型程序执行方法中的两个主要问题:非确定性和低效率。** 首先,SGLang 的语言前端简化了语言模型程序的编程过程,使其更容易编写和理解。其次,SGLang 的运行时后端通过利用语言模型程序的多调用结构,显著提高了执行效率,减少了冗余计算,并利用 RadixAttention、压缩有限状态机和 API 推测执行等技术进一步优化性能。

💡 **SGLang 为语言模型程序提供了多个原生操作符,包括用于控制并行性的 fork 和 join 操作符,以及用于生成的不同操作符,如 extend、gen 和 select。** 这些操作符使开发者能够轻松构建复杂的提示流程,并根据需要控制程序的执行方式。

📊 **SGLang 在各种工作负载、模型和硬件配置下,在吞吐量方面超越了现有的编程和推理系统,如 Guidance、vLLM 和 LMQL。** 研究人员在 NVIDIA A10G 和 A100 GPU 上对 SGLang 进行了评估,结果表明 SGLang 在多个基准测试中取得了显著的性能提升。

📈 **SGLang 的未来研究方向包括:支持更多输出模式、扩展 RadixAttention 到内存层次结构的不同层级、实现模糊语义匹配的 RadixAttention、为 SGLang 添加更高层次的原生操作符、解决缓存感知调度中的饥饿问题、提高 SGLang 编译器在调度和内存规划方面的效率,以及其他高级静态优化。**

Recent advancements in LLM capabilities have increased their usability by enabling them to do a broader range of general activities autonomously. The existing methods for expressing and running LM programs could be more efficient, although they are widely used. There are two main obstacles to effective LM program utilization: The non-deterministic character of LLMs makes programming LM programs tedious and complex. Incorporating parallelism mechanisms, dealing with many input modalities, brittle output parsing, experimental adjustment of prompts, and substantial string manipulation are commonplace tasks in LM software development. This complexity greatly diminishes the readability of even the most basic applications. Second, and most crucially, LM program execution wastes memory and computational resources due to redundant calculations.

A group of researchers from Stanford University, UC Berkeley, Shanghai Jiao Tong University, and Texas A&M University introduced SGLang, a Structured Generation Language for LLMs, to take on these problems. The basic premise is to make use of LM programs’ multi-call structure in a systematic way to speed up their execution. This system comprises a language for the front end and a runtime for the back end. While the runtime speeds up the execution of LM programs, the front end makes LM program programming easier. Both components can operate separately or in tandem for optimal performance. Primitives for controlling parallelism (fork and join) and generation (extend, gen, and select) are provided. Because SGLang works with Python’s libraries and control flow, users may easily build sophisticated prompting processes using the language’s natural syntax. 

The team also presented a compiler and an interpreter for SGLang. By appropriately controlling synchronization and intra-program parallelism, the interpreter ensures that primitive operations are sent to the stream for asynchronous execution and that the prompt state is managed as a stream. Further optimizations can be achieved by tracing and compiling the SGLang program. To speed up the execution of SGLang applications, the researchers suggest several new optimizations on the runtime side. Automatic KV cache reuse across several generation calls is made possible by the first technique, RadixAttention. Current inference engines wastefully trash a request’s KV cache when processing is finished, which makes it impossible to reuse the cache for subsequent calls and drastically slows down execution. In its place, the system stores all requests within a radix tree in an LRU cache of the KV cache. This method employs a radix tree for efficient matching, inserting, and evicting and handles the KV cache similarly to a conventional cache. It efficiently enables the runtime to manage different reuse patterns using a cache-aware scheduling approach. 

A compressed finite state machine is the second method; it allows for restricted decoding of structured outputs to happen more quickly. By hiding the likelihood of forbidden tokens, current systems can only decode a single token at a time, as they only obey the restrictions for the next token. Rather, our approach examines the limitations and constructs a compressed finite-state machine. This method streamlines decoding by combining numerous token paths into one shorter one whenever feasible. This allows for faster decoding of multiple tokens simultaneously. 

Finally, an API-only model, such as OpenAI’s GPT-4, can be optimized for multi-call programs using SGLang. For this, they present a third technique called API speculative execution. Agent control, reasoning, retrieval-augmented generation pipelines, JSON decoding, multiturn chat, multi-modality processing, and few-shot learning benchmarks are some of the LLM applications created using SGLang. 

On NVIDIA A10G and A100 GPUs, the team evaluated the performance with various models, such as Llama7B/70B, Mistral-8x7B, LLaVA-v1.5-7B (picture), and LLaVA-NeXT-34B (video). Based on the experimental results, SGLang outperforms existing programming and inference systems, such as Guidance, vLLM, and LMQL, throughput by up to 6.4 across various workloads, models, and hardware configurations.

Even though SGLang has come a long way, certain restrictions still point to interesting places to go from here in terms of research. Among these improvements are the following: adding support for more output modalities to SGLang, making RadixAttention work on different levels of the memory hierarchy (e.g., DRAM and Disk), making RadixAttention work with fuzzy semantic matching, adding higher-level primitives to SGLang, fixing cache-aware scheduling’s starvation problem, and making the SGLang compiler better at scheduling and memory planning, among other advanced static optimizations.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post SGLang: A Structured Generation Language for Efficient Execution of Complex Language Model Programs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SGLang 语言模型 结构化生成语言 高效执行 并行机制 人工智能
相关文章