Large Model Systems Organization 2024年12月04日
SGLang v0.4: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SGLang v0.4版本发布,带来了显著的性能提升和新功能。该版本引入了零开销批处理调度器,提升了吞吐量;实现了缓存感知负载均衡器,显著提高了缓存命中率和吞吐量;针对DeepSeek模型,优化了数据并行注意力机制,提升了解码速度;并集成了xgrammar,大幅加速了结构化输出。这些优化使得SGLang在LLM推理方面更加高效,为用户提供了更快的响应速度和更流畅的使用体验。同时,该版本还提供了多种使用方法和基准测试,方便用户快速上手和评估性能。

🤔 **零开销批处理调度器:** 通过重叠CPU调度和GPU计算,将CPU开销降至最低,实现吞吐量提升1.1倍,尤其在小型模型和大型张量并行场景下效果显著。该功能默认开启,无需手动配置。

🚀 **缓存感知负载均衡器:** 预测每个worker的缓存命中率,并将请求发送到命中率最高的worker,从而最大化缓存利用率,提升吞吐量最高可达1.9倍,缓存命中率提升至3.8倍。该功能支持多节点部署和分布式场景。

💡 **DeepSeek模型的数据并行注意力:** 针对DeepSeek模型的特点,使用数据并行处理注意力机制,减少了KV缓存,提升了批量大小,解码吞吐量提升1.9倍。目前仅支持DeepSeek模型。

📊 **xgrammar加速结构化输出:** 集成xgrammar作为新的语法后端,显著提升了JSON解码速度,最高可达10倍。用户可以通过添加`--grammar-backend xgrammar`参数来启用该功能。

We’re excited to release SGLang v0.4, featuring significant performance improvements and new features:Zero-overhead batch scheduler: 1.1x increase in throughput.Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate.Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement.Fast structured outputs with xgrammar: up to 10x faster.This blog provides a walkthrough of these updates. We welcome your feedback and contributions!Zero-Overhead Batch SchedulerWhile LLM inference runs on GPUs, there is substantial work that also needs to be done by the CPU, such as batch scheduling, memory allocation, and prefix matching. An unoptimized inference engine can spend as much as half of its time on CPU overhead. SGLang has been known for its efficient batch scheduler from the start. In this new version, we pushed it to the extreme and achieved a near zero-overhead batch scheduler. This idea is simple and has been proposed in NanoFlow. Basically, we can overlap the CPU scheduling with the GPU computation. The scheduler runs one batch ahead and prepares all the metadata required for the next batch. By doing this, we can keep the GPUs always busy and hide expensive overheads such as the radix cache operations. The related code is here. The implementation details involve resolving dependencies by creating future tokens and carefully scheduling CUDA events and synchronization. Below is an illustration of the overlapped CPU scheduler and GPU worker.We verified the zero-overhead claim by using the Nsight profiling system. In the figure below, there are 5 consecutive decoding batches, and you can see there is no single idle time on the GPU. (NOTE: This profile is obtained with the Triton attention backend; there is still a minor gap with the FlashInfer backend, which will be resolved in the next FlashInfer release.)With this optimization, SGLang v0.4 can now squeeze the last bit of performance from the GPU and achieves a 1.1x speedup against its previous version and a 1.3x speedup against other state-of-the-art baselines. The speedup is most significant on small models and large tensor parallelism sizes.Usage: It is turned on by default, so you do not need to change anything!Reproduce benchmark:# zero-overhead batch scheduler (v0.4)python3 -m sglang.launch_server --model meta-llama/Llama-3.2-3B-Instructpython3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 500 --random-input 4096 --random-output 2048# old batch scheduler (v0.3)python3 -m sglang.launch_server --model meta-llama/Llama-3.2-3B-Instruct --disable-overlappython3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 500 --random-input 4096 --random-output 2048Cache-Aware Load BalancerSGLang v0.4 introduces a cache-aware load balancer for LLM inference engines. The load balancer predicts prefix KV cache hit rates on workers and selects those with the highest match rates. Testing shows a up to 1.9x throughput increase and 3.8x hit rate improvement, with benefits scaling as worker count increases. The figure below shows how a cache-aware load balancer is different from a naive round-robin load balancer for data parallelism. The cache-aware load balancer maintains an approximate radix tree of the actual radix tree on the workers. The tree is lazily updated with almost no overhead.Here are some benchmark results. The new cache-aware router significantly improves throughput.SGLang v0.4SGLang v0.3Throughput (token/s)15859682665Cache hit rate75%20%The benchmark is conducted on a workload that has multiple long prefix groups, and each group is perfectly balanced. The performance might vary based on the characteristics of the workload, but it should improve the cache hit rate significantlyThe key features of this router includesMulti-Node Support: Deploy workers across multiple machines, connect a single router to distributed workers, allowing for easy horizontal scaling while preserving cache awareness in a distributed setup.Cache-Aware Routing: Requests are sent to workers with a higher hit rate, and load balancing is performed to avoid imbalance.Communication-Free Design: No worker synchronization is required for cache state; instead, it uses passed information to simulate an "approximate tree".High-Performance Implementation: Built in pure Rust for high concurrency, with a low overhead design, offering a 2x speedup compared to Python-based alternatives.Standalone Package: Published as "sglang-router", includes Python bindings, and features a CLI interface for easy usage.UsageInstallation:pip install sglang-routerCo-launch Workers and RouterDrop-in replacement for existing --dp-size parameter:python -m sglang_router.launch_server \ --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ --dp-size 8Router-Only LaunchIdeal for multi-node distributed processing:python -m sglang_router.launch_server \ --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ --dp-size 8 python -m sglang_router.launch_router \ --worker-urls http://worker1:8000 http://worker2:8000Reproduce benchmark:# Hardware: 8x A100 80GB GPUs# Run benchmarkpython bench_serving.py \ --host 127.0.0.1 \ --port 30000 \ --dataset-name generated-shared-prefix# Launch with routerpython -m sglang_router.launch_server \ --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ --dp-size 8# Launch without router (baseline)python -m sglang.launch_server \ --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ --dp-size 8Learn more by reading the code. There is also a related paper (with a different design and implementation), Preble, which is also built on top of SGLang.Data Parallelism Attention For DeepSeek ModelsThe most common parallelism strategy for inference is tensor parallelism. However, it might not be the most efficient strategy for certain models. For example, DeepSeek models use MLA and only have one KV head. If we use tensor parallelism on 8 GPUs, it will lead to duplicated KV cache and unwanted memory usage.To overcome this, we've implemented data parallelism (DP) for the multi-head latent attention (MLA) mechanism to improve throughput for DeepSeek models. By adopting DP for the attention component, the KV cache is significantly reduced, allowing for larger batch sizes. In our DP attention implementation, each DP worker handles different types of batches (prefill, decode, idle) independently. The attention-processed data will be all-gathered among all workers before entering the Mixture-of-Experts (MoE) layer, and after processing through the MoE, the data will be redistributed back to each worker. The figure below illustrates this idea.Here are the benchmark results on 8 x H100 80GB GPUs. With this optimization, SGLang v0.4 achieved 1.9x decoding throughput compared to SGLang v0.3. We are working on further improving the throughput by integrating expert parallelism for the MoE layers. You can check out the related PRs for data parallelism and expert parallelism.Usage: Add --enable-dp-attention option to turn on this feature. Currently, it’s only supported for DeepSeek models.Reproduce benchmark:# Hardware: 8x H100 80GB GPUs# If you see out-of-memory, please try to reduce --mem-fraction-static to a smaller value such as 0.75.# SGLang w/ DP attention (v0.4)python3 -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --disable-radix-cache --trust-remote-code --tp 8 --enable-dp-attention --mem-fraction-static 0.78python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 10000# SGLang w/o DP attention (v0.3)python3 -m sglang.launch_server --model-path neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --disable-radix-cache --trust-remote-code --tp 8 --mem-fraction-static 0.78python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 1 --random-output 512 --random-range-ratio 1 --num-prompts 10000Fast Structured Outputs with XGrammarSGLang has been the fastest inference engine for JSON decoding with its Compressed Finite State Machine. With this new release, it becomes even faster by integrating a faster grammar backend, xgrammar.According to the benchmark results, SGLang + xgrammar can be up to 10x faster than other open-source solutions for JSON decoding tasks. You can learn more in the xgrammar blog post:https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar.Usage: Add --grammar-backend xgrammar when launching the server.python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --grammar-backend xgrammarYou can then query it with the OpenAI-compatible API. See an example at https://sgl-project.github.io/backend/openai_api_completions.html#JSON.AcknowledgmentThe work in this blog post is mainly contributed by Byron Hsu, Ke Bao, Lianmin Zheng, Yineng Zhang, and Ziyi Xu. We thank Zhiqiang Xie, Liangsheng Yin, Shuo Yang, and Yilong Zhao for their discussions on the zero-overhead scheduler; Ying Sheng, Yichuan Wang, and Shiyi Cao for their discussions on the cache-aware load balancer; Jiashi Li for their discussion on data parallelism attention; and Yixin Dong for the amazing xgrammar library.RoadmapIt has been a great year, and we delivered many features following our roadmap.The community is also growing healthily with more developers and adoption.The focus of the next release will be on disaggregated prefill-decode, speculative decoding, multi-level radix cache, sequence parallelism, and more!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SGLang LLM推理 性能优化 缓存 数据并行
相关文章