MarkTechPost@AI 04月04日 04:10
UB-Mesh: A Cost-Efficient, Scalable Network Architecture for Large-Scale LLM Training
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章介绍了UB-Mesh,一种专为大规模LLM训练设计的高效、可扩展且经济的网络架构。随着LLM的规模增长,对计算和带宽的需求激增,传统的AI数据中心架构面临挑战。UB-Mesh采用nD-FullMesh拓扑结构,优化短程互连,降低了对昂贵交换机和光模块的依赖。通过创新的硬件设计和All-Path Routing (APR)机制,UB-Mesh在降低成本的同时,保持了与传统Clos网络相当的性能,并提高了网络的可用性。研究结果表明,UB-Mesh在构建高效、弹性的AI训练基础设施方面具有显著优势。

🌐 LLM规模扩张带来了对计算和带宽的巨大需求,传统AI数据中心架构面临挑战,需要重新思考网络设计。

💡 UB-Mesh采用nD-FullMesh拓扑,优化短程互连,减少对交换机和光模块的依赖,降低了硬件成本。

⚙️ UB-Mesh通过All-Path Routing (APR)机制增强数据流量管理,并配备64+1备份系统,提高了系统的容错能力。

💰 与Clos网络相比,UB-Mesh在LLM训练中实现了2.04倍的成本效益,同时保持了接近的性能表现。

As LLMs scale, their computational and bandwidth demands increase significantly, posing challenges for AI training infrastructure. Following scaling laws, LLMs improve comprehension, reasoning, and generation by expanding parameters and datasets, necessitating robust computing systems. Large-scale AI clusters now require tens of thousands of GPUs or NPUs, as seen in LLAMA-3’s 16K GPU training setup, which took 54 days. With AI data centers deploying over 100K GPUs, scalable infrastructure is essential. Additionally, interconnect bandwidth requirements surpass 3.2 Tbps per node, far exceeding traditional CPU-based systems. The rising costs of symmetrical Clos network architectures make cost-effective solutions critical, alongside optimizing operational expenses such as energy and maintenance. Moreover, high availability is a key concern, as massive training clusters experience frequent hardware failures, demanding fault-tolerant network designs.

Addressing these challenges requires rethinking AI data center architecture. First, network topologies should align with LLM training’s structured traffic patterns, which differ from traditional workloads. Tensor parallelism, responsible for most data transfers, operates within small clusters, while data parallelism involves minimal but long-range communication. Second, computing and networking systems must be co-optimized, ensuring effective parallelism strategies and resource distribution to avoid congestion and underutilization. Lastly, AI clusters must feature self-healing mechanisms for fault tolerance, automatically rerouting traffic or activating backup NPUs when failures occur. These principles—localized network architectures, topology-aware computation, and self-healing systems—are essential for building efficient, resilient AI training infrastructures.

Huawei researchers introduced UB-Mesh, an AI data center network architecture designed for scalability, efficiency, and reliability. Unlike traditional symmetrical networks, UB-Mesh employs a hierarchically localized nD-FullMesh topology, optimizing short-range interconnects to minimize switch dependency. Based on a 4D-FullMesh design, its UB-Mesh-Pod integrates specialized hardware and a Unified Bus (UB) technique for flexible bandwidth allocation. The All-Path Routing (APR) mechanism enhances data traffic management, while a 64+1 backup system ensures fault tolerance. Compared to Clos networks, UB-Mesh reduces switch usage by 98% and optical module reliance by 93%, achieving 2.04× cost efficiency with minimal performance trade-offs in LLM training.

UB-Mesh is a high-dimensional full-mesh interconnect architecture designed to enhance efficiency in large-scale AI training. It employs an nD-FullMesh topology, minimizing reliance on costly switches and optical modules by maximizing direct electrical connections. The system is built on modular hardware components linked through a UB interconnect, streamlining communication across CPUs, NPUs, and switches. A 2D full-mesh structure connects 64 NPUs within a rack, extending to a 4D full-mesh at the Pod level. For scalability, a SuperPod structure integrates multiple Pods using a hybrid Clos topology, balancing performance, flexibility, and cost-efficiency in AI data centers.

To enhance the efficiency of UB-Mesh in large-scale AI training, we employ topology-aware strategies for optimizing collective communication and parallelization. For AllReduce, a Multi-Ring algorithm minimizes congestion by efficiently mapping paths and utilizing idle links to enhance bandwidth. In all-to-all communication, a multi-path approach boosts data transmission rates, while hierarchical methods optimize bandwidth for broadcasting and reduce operations. Additionally, the study refines parallelization through a systematic search, prioritizing high-bandwidth configurations. Comparisons with Clos architecture reveal that UB-Mesh maintains competitive performance while significantly reducing hardware costs, making it a cost-effective alternative for large-scale model training.

In conclusion, the UB IO controller incorporates a specialized co-processor, the Collective Communication Unit (CCU), to optimize collective communication tasks. The CCU manages data transfers, inter-NPU transmissions, and in-line data reduction using an on-chip SRAM buffer, minimizing redundant memory copies and reducing HBM bandwidth consumption. It also enhances computer-communication overlap. Additionally, UB-Mesh efficiently supports massive-expert MoE models by leveraging hierarchical all-to-all optimization and load/store-based data transfer. The study introduces UB-Mesh, an nD-FullMesh network architecture for LLM training, offering cost-efficient, high-performance networking with 95%+ linearity, 7.2% improved availability, and 2.04× better cost efficiency than Clos networks.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post UB-Mesh: A Cost-Efficient, Scalable Network Architecture for Large-Scale LLM Training appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

UB-Mesh LLM 网络架构 AI训练
相关文章