Large Model Systems Organization 10小时前
Deploying Kimi K2 with PD Disaggregation and Large-Scale Expert Parallelism on 128 H200 GPUs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文详细介绍了如何利用OME和SGLang技术栈,成功将拥有万亿级别参数的开源MoE模型Kimi K2部署在128块H200 GPU上。通过结合Prefill-Decode (PD) Disaggregation和Large-Scale Expert Parallelism (EP)策略,实现了计算密集型预填充任务与延迟敏感型解码任务的解耦,并优化了MoE模型的专家路由和负载均衡。该方案不仅显著提升了模型吞吐量和成本效益(每百万输出token约0.21美元),还通过模型驱动的部署方式简化了生产环境中的模型管理,为大规模、低延迟、可维护的AI推理基础设施提供了坚实基础。

🚀 Kimi K2模型概述与部署挑战:Kimi K2是Moonshot AI于2025年发布的先进开源MoE模型,拥有1万亿总参数和320亿激活参数,支持长上下文。其大规模部署面临计算稀疏性、跨节点通信和负载不均衡等挑战,需要优化系统设计和部署流程以充分发挥硬件能力并降低成本。

💡 OME与SGLang的协同作用:文章介绍了OME(Open Model Engine)作为模型驱动的部署框架,简化了复杂模型的部署流程,将模型设计、并行策略、扩展性和运行时配置等抽象化。结合SGLang的PD Disaggregation和Large-Scale EP技术,实现了预填充与解码任务的独立扩展和低延迟处理,并通过动态路由和负载均衡优化了MoE的性能。

⚙️ 关键技术实现细节:具体部署方案包括使用OME注册Kimi K2模型和SGLang Serving Runtime,通过Kubernetes进行部署。利用PD Disaggregation将预填充(计算密集,大批次)和解码(延迟敏感,高吞吐)分离;通过Large-Scale EP策略(如96个冗余专家、NUMA感知GPU分组)解决MoE的负载均衡和GPU利用率问题,优化了NVLink和PCIe性能。

📊 性能表现与成本效益:在128块H200 GPU上的基准测试显示,Kimi K2实现了896k tokens/sec的预填充吞吐量和384k tokens/sec的解码吞吐量。与DeepSeek R1相比,Kimi K2在更大模型和更复杂路由下,仍能保持每节点更高的吞吐量。成本方面,每百万输出token约0.21美元(基于H200每小时2.3美元),为短上下文场景提供了极具吸引力的成本效益。

🤝 开源协作与未来展望:此次部署的全部组件均开源且可复现,是Mooncake与SGLang社区及NVIDIA DGX Cloud合作的成果。未来工作将侧重于优化Kimi K2的长上下文场景处理,以满足其在Agentic任务中动辄数万token的输入需求,进一步释放其先进推理能力。

1️⃣ Introduction: Deploying the Most Advanced Open-Source MoE Model

Kimi K2 is currently the most advanced open-source Mixture-of-Experts (MoE) model available.

Released by Moonshot AI in 2025, it features:

    1 trillion total parameters32 billion activated parameters per token384 experts with dynamic routingMulti-head Latent Attention (MLA) for long context support

Kimi K2 achieves strong performance in frontier knowledge, math, and coding, and is optimized for agentic tasks—not just answering questions but taking multi-step actions.

Moonshot AI open-sourced two versions:

    Kimi-K2-Base: The foundation model for research and fine-tuningKimi-K2-Instruct: A post-trained model for general-purpose chat and agentic applications

For more details, please refer to the official Kimi K2 release.


Why Large-Scale Deployment Matters

Large-scale deployment fully leverages hardware capabilities and reduces costs given the model’s architecture.

    Serve More Requests, Faster: Higher throughput, lower latency, more concurrent sessions, and shorter queues.Lower $/Token: Saturate hardware and amortize model load; efficiency improves at scale.

However, the large-scale deployment of trillion-scale MoE models present unique challenges:

    Computational sparsity in MoE layers necessitates large batch sizes to make matrix operations compute-intensive. Large-scale Expert Parallelism (EP) scales parallelism strategies across more GPUs, aggregates requests from multiple devices, reduces per-GPU memory pressure, and frees up VRAM for larger KV caches—effectively increasing batch size.Cross-node communication takes a large amount of time and requires optimizationsSparse expert activation leads to load imbalance

Efficient deployment of Kimi K2 on 128 H200 GPUs requires rethinking both system design and deployment workflows.

In this blog, we explain how we solved this problem using OME and SGLang.


2️⃣ Background: From DeepSeek R1 to Kimi K2

In May 2025, we published Deploying DeepSeek R1 with PD Disaggregation and Large-Scale EP, where we demonstrated:

    Prefill-Decode (PD) Disaggregation to separate compute-heavy and latency-sensitive tasksLarge-Scale Expert Parallelism (EP) to handle MoE routing across 96 GPUs5× throughput improvement compared to vanilla tensor parallelism on H100s

At the same time, our OME blog introduced model-driven deployment, solving the operational gap between:

    ML Engineers, who design complex serving strategiesProduction Engineers, who need simple and reliable deployments

The OME insight—the model should drive deployment, not vice-versa—proved productive for scaling to Kimi K2’s 1T-parameter architecture. This transition required adapting DeepSeek’s PD Disaggregation and EP to Kimi K2’s 384 experts while maintaining high performance.


3️⃣ Our Solution: OME + SGLang PD Disaggregation + Large-Scale Expert Parallelism

For Kimi K2, we combined the strengths of OME and SGLang to create an optimized, scalable deployment pipeline.

Model-Driven Deployment with OME

OME (Open Model Engine) simplifies the deployment of advanced models like Kimi K2 by abstracting away the complexity of parallelism, sharding, scaling, and runtime configuration. With a declarative configuration model, OME enables production teams to deploy and manage large models without manual tuning or custom scripting.

OME Installation

Install OME directly from the OCI registry using the following commands:

helm upgrade --install ome-crd oci://ghcr.io/moirai-internal/charts/ome-crd --namespace ome --create-namespacehelm upgrade --install ome oci://ghcr.io/moirai-internal/charts/ome-resources --namespace ome

For detailed setup instructions, refer to the official OME installation guide.

Registering the Kimi K2 ModelTo enable OME to manage the Kimi K2 model family, apply the following ClusterBaseModel resource:

kubectl apply -f https://raw.githubusercontent.com/sgl-project/ome/refs/heads/main/config/models/moonshotai/Kimi-K2-Instruct.yaml

Note: You may download the YAML file and customize the path field to specify where the model should be stored locally. OME will download the model directly from Hugging Face with optimized parallelism and automatically verify the artifact checksum to ensure integrity.

Installing the Kimi K2 latest SGLang Serving Runtime

kubectl apply -f https://raw.githubusercontent.com/sgl-project/ome/refs/heads/main/config/runtimes/srt/kimi-k2-pd-rt.yaml

Deploying the Model

Once the model and runtime are registered, deploy the inference endpoint using:

kubectl apply -f https://raw.githubusercontent.com/sgl-project/ome/refs/heads/main/config/samples/isvc/moonshotai/kimi-k2-pd.yaml

With these declarative resources in place, OME will automatically handle model downloading, runtime orchestration, and endpoint provisioning—enabling scalable, production-grade inference for the Kimi K2 model family.

Interacting with the ModelThis command forwards local port 8080 to model on port 80:

kubectl port-forward -n kimi-k2-instruct service/kimi-k2-instruct 8080:80

Leave this running in one terminal. It will route your local http://localhost:8080 to the SGlang router. After the port-forward is active, run this in a second terminal:

curl -s -X POST http://localhost:8080/generate \  -H 'Content-Type: application/json' \  -H 'Authorization: Bearer None' \  -d '{    "text": "The future of AI is",    "max_new_tokens": 50,    "temperature": 0.7  }'

OME Advantages & PD + DeepEP + Router Insights

OME (Open Model Engine) offers a declarative, production-ready framework for deploying large models like Kimi K2. It abstracts the complexities of GPU topology, distributed configuration, and runtime tuning—eliminating the need for custom orchestration logic. With a single ClusterServingRuntime definition, teams can launch optimized multi-node inference workloads at scale.

This configuration demonstrates a powerful setup leveraging Prefill-Decode (PD) disaggregation and Large Scale EP, enabling:

    Disaggregated scaling of prefill and decode workloads with independent resource controlLow-latency decode via deepep-mode=low_latency and token-aware dispatch tuningAdvanced expert routing with ep-dispatch-algorithm=dynamic and enable-eplbRDMA acceleration for high-throughput kv-cache transfer

The deployment is orchestrated by a lightweight SGLang Router, which provides:

    Dynamic service discovery for prefill and decode nodes via label selectorsAuto-scaling capabilities independent of engine and decoder workloadsLeast-privilege routing model—ideal for secure production environmentsOptimized load balancing tailored for disaggregated serving patterns

Together, OME and the SGLang Router form a robust foundation for large-scale, low-latency, and maintainable inference infrastructure.

Prefill-Decode Disaggregation

We separate inference into two independent components:

StageRole
PrefillHandles large prompt ingestion (e.g., 2000-token inputs). This is compute-bound and benefits from large batch parallelism.
DecodeHandles autoregressive generation (e.g., 100-token outputs). This is latency-sensitive and optimized for high-throughput outputs.

Prefill and Decode are deployed as independent services, each scaled and optimized separately.


Large-Scale Expert Parallelism (EP)

Kimi K2 activates a subset of 384 experts per token. We implemented:

    96 redundant experts on decode nodes to balance MoE routingNUMA-aware GPU grouping for optimal NVLink and PCIe utilization on H200 clusters

This design minimizes load imbalance and ensures even GPU utilization across the 128-card cluster.


4️⃣ Performance: 2000-Input, 100-Output Benchmark

We benchmarked Kimi K2 using a typical LLM serving workload on 128 H200 GPUs with 1P1D (4 nodes/P and 12 nodes/D):

MetricValue
Input Length2000 tokens
Output Length100 tokens
Decode Batch Size480

We use the same benchmark setup as in the DeepSeek R1 deployment blog as an example. Longer output for agentic scenarios will be future work.

Note: The prefill-to-decode ratio is workload-dependent. We prioritized decode nodes to maximize the KV Cache pool size, which is critical for scaling batch size to 480.


Cluster-Level Performance (128 × H200 GPUs)

MetricValue
Prefill Throughput896k tokens/sec
Decode Throughput384k tokens/sec
Cost per 1M Output Tokens~$0.21(H200 $2.3/hour)

Comparison to DeepSeek R1 Deployment

ModelExpertsGPUsPrefill Throughput (tokens/sec)Decode Throughput (tokens/sec)
DeepSeek R125696 × H10052.3k / node22.3k / node
Kimi K2384128 × H20056k / node24k / node

Despite Kimi K2’s larger MoE and more complex routing, our deployment achieves:

    Balanced expert activation, using expert-parallel load balancer (EPLB)High throughput per GPU by applying SGLang’s specific optimizations for DeepSeek V3 architecture to H200

The next step involves evaluating and optimizing long-context scenarios. As K2 is a model designed for agentic tasks, it has been reported that the average input length in such scenarios can range from 30,000 to 50,000 tokens.


5️⃣ Conclusion: Trillion-Scale Inference at Scale

By combining OME, SGLang, PD Disaggregation, and Large-Scale Expert Parallelism, we deployed Kimi K2 on 128 H200 GPUs, achieving:

    Cost-effective large-scale inference (~$0.21 per 1M output tokens on H200) is available for short-context scenarios, with ongoing efforts to optimize the long-context scenarios.Simplified deployment workflows with model-driven configuration

All components of this deployment are fully open-source and reproducible. We welcome the community to build on this work.

This deployment was made possible not only by open collaboration between Mooncake and the SGLang community, but also through the generous infrastructure support from NVIDIA DGX Cloud. NVIDIA provided the SGLang team with access to 128 H200 GPUs via DGX Cloud, enabling us to accelerate the deployment of Kimi K2 from model release to production-grade inference very quickly. As a result, organizations can now leverage SGLang to serve Kimi K2 at scale, unlocking advanced reasoning capabilities with state-of-the-art performance.


Acknowledgments

We would like to express our heartfelt gratitude to the following teams and collaborators:

    Mooncake Team: Boxin Zhang, Shangming Cai, Mingxing Zhang, and colleagues.SGLang Team and community: Simo Lin, Jingyi Chen, Qiaolin Yu, Yanbo Yang, Yineng Zhang, and many others.

We extend our thanks to the MoonshotAI Team—including Shaowei Liu, Zhengtao Wang, Weiran He, Xinran Xu, and others—for their support in tuning the big beautiful model K2.


Further Reading

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Kimi K2 MoE模型 GPU部署 OME SGLang
相关文章