Latent 01月19日
Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了纽约AI工程师峰会的赞助和申请信息,并重点分析了DeepSeek v3模型的性能表现及其在LM Arena上的排名。DeepSeek v3目前位列全球第七,是开源模型中的佼佼者。文章还提及了中国实验室发布大型开源模型的趋势,以及这些模型在部署时面临的挑战。Baseten公司率先上线DeepSeek v3,得益于其H200集群、与DeepSeek团队的合作以及对SGLang的支持。文章深入探讨了关键任务推理的三个支柱:模型层面的性能、集群层面的性能以及区域层面的性能,并强调了多区域扩展的重要性。

🚀DeepSeek v3模型:在LM Arena上排名全球第七,是目前最佳的开源权重模型,性能超越多个知名模型。

🏢关键任务推理的三大支柱:模型层面的优化、集群层面的水平扩展以及跨区域的部署能力,三者缺一不可,共同支撑大规模AI应用。

🛠️Baseten的技术优势:通过H200集群、与DeepSeek团队的紧密合作以及对SGLang的支持,率先上线DeepSeek v3模型,并实现了高效推理。

🌐多区域扩展的重要性:为应对单区域GPU容量限制,需要将模型副本部署到不同区域,以满足高并发需求,确保服务稳定。

Sponsorships and applications for the AI Engineer Summit in NYC are live! (Speaker CFPs have closed) If you are building AI agents or leading teams of AI Engineers, this will be the single highest-signal conference of the year for you.


Right after Christmas, the Chinese Whale Bros ended 2024 by dropping the last big model launch of the year: DeepSeek v3. Right now on LM Arena, DeepSeek v3 is rated the 7th best model in the world with a score of 1319, right under the full o1 model, Gemini 2, and 4o latest, and above o1-mini, Grok 2, Gemini 1.5 Pro, and Claude 3.5 Sonnet. This makes it the best open weights model in the world in January 2025.

There has been a big recent trend in Chinese labs releasing very large open weights models, with TenCent releasing Hunyuan-Large in November and Hailuo releasing MiniMax-Text this week, both over 400B in size. However these extra-large language models are very difficult to serve.

deepseek is slow to serve, per artificialanalysis

Baseten was the first of the Inference neocloud startups to get DeepSeek V3 online, because of their H200 clusters, their close collaboration with the DeepSeek team and early support of SGLang, a new VLLM alternative out of UC Berkeley that is also used at frontier labs like X.ai. Each H200 has 141 GB of VRAM with 4.8 TB per second of bandwidth, meaning that you can use 8 H200's in a node to inference DeepSeek v3 in FP8, taking into account KV Cache needs.

We have been close to Baseten since Sarah Guo introduced Amir Haghighat to swyx, and they supported the very first Latent Space Demo Day in San Francisco, which was effectively the trial run for swyx and Alessio to work together!

Since then, Philip Kiely also led a well attended workshop on TensorRT LLM at the 2024 World's Fair.

We worked with him to get two of their best representatives, Amir and Lead Model Performance Engineer Yineng Zhang, to discuss DeepSeek, SG Lang, and everything they have learned running Mission Critical Inference workloads at scale for some of the largest AI products in the world.

The Three Pillars of Mission Critical Inference

“I think it takes three things, and each of them individually is necessary but not sufficient:

    Performance at the model level: how fast are you running this one model running on a single GPU, let's say. The framework that you use there can, can matter. The techniques that you use there can matter. The MLA technique, for example, that Yineng mentioned, or the CUDA kernels that are being used. But there's also techniques being used at a higher level, things like speculative decoding with draft models or with Medusa heads. And these are implemented in the different frameworks, or you can even implement it yourself, but they're not necessarily tied to a single framework. But using speculative decoding gets you massive upside when it comes to being able to handle high throughput. But that's not enough. Invariably, that one model running on a single GPU, let's say, is going to get too much traffic that it cannot handle.

    Performance at the cluster level: And at that point, you need to horizontally scale it. That's not an ML problem. That's not a PyTorch problem. That's an infrastructure problem. How quickly do you go from, a single replica of that model to 5, to 10, to 100. And so that's the second, that's the second pillar that is necessary for running these machine critical inference workloads.

    And what does it take to do that? It takes, some people are like, Oh, You just need Kubernetes and Kubernetes has an autoscaler and that just works. That doesn't work for, for these kinds of mission critical inference workloads. And you end up catching yourself wanting to bit by bit to rebuild those infrastructure pieces from scratch. This has been our experience.

    Performance at the region level: And then going even a layer beyond that, Kubernetes runs in a single. cluster. It's a single cluster. It's a single region tied to a single region. And when it comes to inference workloads and needing GPUs more and more, you know, we're seeing this that you cannot meet the demand inside of a single region. A single cloud's a single region. In other words, a single model might want to horizontally scale up to 200 replicas, each of which is, let's say, 2H100s or 4H100s or even a full node, you run into limits of the capacity inside of that one region. And what we had to build to get around that was the ability to have a single model have replicas across different regions. So, you know, there are models on Baseten today that have 50 replicas in GCP East and, 80 replicas in AWS West and Oracle in London, etc.

The final one is wrapping the power of the first two pillars in a very good developer experience to be able to afford certain workflows like the ones that I mentioned, around multi step, multi model inference workloads, because more and more we're seeing that the market is moving towards those that the needs are generally in these sort of more complex workflows.

Show Notes

Full YouTube Episode

Please like and subscribe!

Timestamps

Baseten’s previous TensorRT LLM workshop:

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DeepSeek v3 AI推理 模型部署 SGLang Baseten
相关文章