Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

cs.AI updates on arXiv.org 5小时前

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

本文介绍了一种名为Block的分布式调度框架，旨在通过利用请求上下文信息优化大型语言模型服务框架中的负载均衡和自动配置。Block系统通过预测调度，在12 GPU集群测试中，性能优于传统调度器，提升服务能力并降低延迟。

arXiv:2508.03611v1 Announce Type: cross Abstract: This paper presents Block, a distributed scheduling framework designed to optimize load balancing and auto-provisioning across instances in large language model serving frameworks by leveraging contextual information from incoming requests. Unlike popular model serving systems that rely on monolithic and heuristic task schedulers, Block operates as a fully distributed, stateless, and predictive scheduling system to achieve low overhead, reliability, and scalability. It leverages the deterministic and predictable characteristics of LLM inferences, such as host configurations, response lengths, and hardware performance, to make scheduling decisions based on accurately predicted metrics. Evaluation on a 12 GPUs cluster shows that Block significantly outperforms heuristic schedulers, boosting serving capacity by up to 16.7\% and reducing P99 tail latency by up to 49.5\%. These performance gains remain consistent across diverse models, workloads and configurations. Code and data are open-sourced.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

分布式调度负载均衡大型语言模型

相关文章

Is Claude 3 Outperforming GPT-4?

Harmonizing AI: Crafting Personalized Song Suggestions

AI News Weekly - Issue #377: Next in AI : Pioneers' Predictions! - Mar 21st 2024

COLLAGE: A New Machine Learning Approach to Deal with Floating-Point Errors in Low-Precision to Make LLM Training Accurate and Efficient

Leveraging Linguistic Expertise in NLP: A Deep Dive into RELIES and Its Impact on Large Language Models

Japanese Researchers Release “Fugaku-LLM” Trained on the Fugaku Supercomputer

Teaching Large Language Models to Reason with Reinforcement Learning with Alex Havrilla - #680

Deep Learning, Transformers, and the Consequences of Scale with Oriol Vinyals - #546

AI Gateway Provider Portkey.ai Is In Partnership With F5

Powering AI with the World's Largest Computer Chip with Joel Hestness - #684