未知数据源 2024年10月02日
SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SGLang v0.3发布,带来显著性能提升及对新模型架构的扩展支持。包括DeepSeek Multi-head Latent Attention(MLA)的高吞吐量、torch.compile在小批量上的低延迟、LLaVA-OneVision的多模态支持、Gemma-2的交错窗口注意力支持等关键更新,并介绍了相关优化和协作情况。

🎯SGLang v0.3对DeepSeek Multi-head Latent Attention(MLA)进行了多种优化,实现了3x到7x的高吞吐量提升,适用于多种模型,在H100 GPU上的BF16和FP8环境下进行了测试,相关优化的可重复性说明在附录中。

🔥Torch.compile被集成到SGLang中,对线性/范数/激活层进行优化,在小批量(1到32)上观察到加速效果,SGLang在批量大小为1时比gpt-fast更快,并支持所有在线服务功能,与相关团队积极合作进行优化。

👀LLaVA-OneVision与SGLang v0.3协作,实现了在单图像、多图像和视频任务中的出色表现,可通过OpenAI兼容的视觉API进行查询,在特定数据集上验证了模型准确性并报告了基准结果,速度提升显著。

💻Gemma-2模型的交错窗口注意力在SGLang v0.3中得到全面支持,通过优化的窗口注意力内核和KV缓存管理器,实现了8K上下文长度的支持,而其他缺乏此功能的库只能运行4K上下文长度的情况。

We're excited to announce the release of SGLang v0.3, which brings significant performance enhancements and expanded support for novel model architectures. Here are the key updates:Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA)Up to 1.5x lower latency with torch.compile on small batch sizesSupport for interleaved text and multi-image/video in LLaVA-OneVisionSupport for interleaved window attention and 2x longer context length in Gemma-2In this blog post, we'll walk you through these key features. Please do not hesitate to report any issues or contribute ideas and code.DeepSeek Multi-head Latent Attention (MLA) Throughput OptimizationsMulti-head Latent Attention (MLA) is a new attention variant introduced by the DeepSeek team to improve inference efficiency. Due to its differences from standard attention mechanisms, existing open-source libraries have not fully optimized this operation. In SGLang v0.3, we implemented various optimizations for MLA, including weight absorption, grouped decoding kernels, FP8 batched MatMul, and FP8 KV cache quantization. Benchmark results show that SGLang v0.3 with MLA optimizations achieves 3x to 7x higher throughput than the baseline system. The benchmark measures the peak output throughput of these models with BF16 and FP8 on H100 GPUs (tensor-parallelism=1 for lite models and tensor-parallelism=8 for big models) on the ShareGPT datasets. Reproducible instructions are in the appendix. While encouraging, there is still much room for improvement. We are actively working on more optimizations to fully reproduce the results from the DeepSeek paper. Related PRs:#905,#1060,#1138,#469,#1285,#1286.Torch.compile Latency OptimizationsTorch.compile is a major feature of PyTorch 2.0. On NVIDIA GPUs, it performs aggressive fusion and generates highly efficient Triton kernels. We've integrated torch.compile into SGLang for linear/norm/activation layers, combining it with FlashInfer attention and sampling kernels. We turn on torch.compile for batch sizes 1 to 32, where we observed the most acceleration. With this combination, SGLang is faster than gpt-fast at batch size 1 and supports all online serving features, including continuous batching and RadixAttention for prefix caching. We are actively collaborating with the torch.compile and torchao teams to incorporate their latest optimizations into SGLang. To use torch.compile in SGLang, add --enable-torch-compile when launching the server. SGLang w/ torch.compile yields up to a 1.5x speedup in the following benchmark. Reproducible instructions are in the appendix.LLaVA-OneVision Support with Interleaved Text, Multi-Image, and VideoLLaVA-OneVision is the first open model to achieve state-of-the-art performance in three important computer vision scenarios: single-image, multi-image, and video tasks. We collaborated with the LLaVA team to integrate these capabilities into SGLang v0.3. You can launch a server and query it using the OpenAI-compatible vision API, which supports interleaved text, multi-image, and video formats. Usage details are available here. The authors validated the model's accuracy and reported benchmark results on the VideoDetailDescriptions and LLaVA-in-the-wild datasets (see #1123). SGLang archives up to 4.5x speedup than the authors’ original implementation in HuggingFace/transformers.Gemma-2 Support with Interleaved Window AttentionGoogle's Gemma-2 model uses interleaved window attention to reduce computational complexity for long contexts, alternating between local sliding window attention (4K context length) and global attention (8K context length) in every other layer. We enhanced SGLang v0.3 to fully support the 8K context length by leveraging the optimized window attention kernel from FlashInfer kernels (which skips computation instead of masking) and refining our KV cache manager. Other libraries that lack this feature can only run with a 4K context length. You can launch the model withpython3 -m sglang.launch_server --model-path google/gemma-2b AcknowledgmentThe DeepSeek MLA optimizations were contributed by Ke Bao and Yineng Zhang. The torch.compile optimizations were contributed by Liangsheng Yin. The LLaVA-OneVision contributions were made by Kaichen Zhang and Bo Li. The interleaved window attention was contributed by Ying Sheng. We also thank all 90+ open-source contributors.AppendixBenchmark Instructions for DeepSeek MLA# DeepSeekCoder-V2-Lite (BF16)## Launch a server python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --enable-mla --disable-radix --trust-remote-code python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct --disable-log-requests --trust-remote-code --max-model-len 4096## Run benchmark python3 -m sglang.bench_serving --backend sglang --num-prompts 5000 python3 -m sglang.bench_serving --backend vllm --num-prompts 5000# DeepSeekCoder-V2 (BF16) ## Launch a server python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --disable-radix --tp 8 --trust-remote-code --enable-mla python3 -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-Coder-V2-Instruct --disable-log-requests --tensor-parallel-size 8 --trust-remote-code --max-model-len 4096## Run benchmark python3 -m sglang.bench_serving --backend sglang --num-prompts 5000 python3 -m sglang.bench_serving --backend vllm --num-prompts 5000# DeepSeekCoder-V2 (FP8) ## Launch a server python3 -m sglang.launch_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --enable-mla --quantization fp8 --kv-cache-dtype fp8_e5m2 --disable-radix --tp 8 --trust-remote-code python3 -m vllm.entrypoints.openai.api_server --model neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 --quantization fp8 --disable-log-requests --tensor-parallel-size 8 --trust-remote-code --max-model-len 4096## Run benchmark python3 -m sglang.bench_serving --backend sglang --num-prompts 5000 python3 -m sglang.bench_serving --backend vllm --num-prompts 5000 Benchmark Instructions for torch.compile# SGLang ## Launch a server python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3-8B --enable-torch-compile## Run benchmark python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input-len 128 --random-output-len 512 --random-range-ratio 1 --num-prompts 1# vLLM ## Launch a server python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-8B --disable-log-requests## Run benchmark python3 -m sglang.bench_serving --backend vllm --dataset-name random --random-input-len 128 --random-output-len 512 --random-range-ratio 1 --num-prompts 1

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SGLang v0.3 性能提升 模型架构 技术优化
相关文章