EnterpriseAI 2024年10月01日
Bridging the Infrastructure Gaps To Accommodate Skyrocketing AI Growth
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AI对基础设施的影响,强调需提高运营标准以避免故障。涉及多租户数据中心中AI项目的添加、Kubernetes Ingress的改进、有效多租户的构建以及GPU性能的优化等方面,指出了存在的问题及解决方案。

🎯AI加入现有基础设施可能导致不可避免的故障,多租户数据中心中同时添加众多AI项目,存在诸多风险,需审计、预测和监控相关问题。

🚀Kubernetes在HPC AI集群操作中已被嵌入,F5 BIG-IP提供Kubernetes Ingress Controller以保障AI和HPC集群应用的安全入口,BIG-IP Next SPK可改善网络可见性和效率并降低TCO。

💻优化基础设施以适应GPU规模是重要问题,将CPU网络流量卸载到DPU/IPU可带来多种好处,部署SPK后可简化AI服务中间件的部署。

The general consensus that AI would “change everything” is proving to be correct when it comes to infrastructure impacts. The new imperative is to raise and maintain operational standards to avoid inevitable outages caused by adding AI to existing infrastructure. Already key elements are straining to keep pace. It’s hard to tell which new AI workload addition might break it altogether.

That is especially true when you consider there are so many AI projects being added at the same time across multi-tenant data centers. With so much at stake and so little under your control, what can be done to ensure your AI workloads sail through at scale?

High risks exist when HPC clusters are too quickly deployed in multi-tenant data centers without proper network-level tenancy that aligns with existing systems. This can lead to delays, increased costs, new vulnerabilities, and even service disruptions. It's crucial to audit, anticipate, and monitor these issues to plan and act accordingly.

Improving Kubernetes Ingress

Kubernetes is embedded within HPC AI cluster operations already. Kubernetes networking allows containerized processes to communicate within the cluster directly by design, and a flat network structure is maintained across each cluster to accomplish this technological feat. But when services need to interact with external applications or other segregated processes within the cluster, a Kubernetes service resource must handle ingress communication. Additional infrastructure components, called Kubernetes Ingress Controllers, are required to manage network traffic ingress that are not a standard network component.

F5, renowned for decades providing network load balancing, most commonly for its F5 BIG-IP product suite, has expanded to offer higher-volume software and hardware designed to adapt networking infrastructure to accommodate accelerating AI growth. In short, F5 BIG-IP offers a Kubernetes Ingress Controller to provide secured ingress for AI and HPC clustered applications. Adding to its appeal is the fact that BIG-IP is a well understood, trusted and accepted component in enterprise data centers for both NetOps and SecOps teams.

Building for Effective Multi-tenancy

Kubernetes nodes use NodeIPs to manage inter-host routing within the cluster, which at first seems like a solid, distributed network design. And it mostly is if the cluster is dedicated to a single tenant. However, traffic from different security tenants within the cluster is sourced from the same NodeIP, making it difficult for traditional monitoring and security tools to differentiate between tenants. This lack of visibility complicates network security, particularly in HPC AI clusters.

Wide adoption of 5G by global network providers (telecoms) extended the scope of the problem because doing so meant also adopting multi-tenant Kubernetes clusters.

This need drove to the evolution of F5 BIG-IP Next for modular versions of the network stack and where BIG-IP Next Service Proxy for Kubernetes (SPK) was born. Beyond ingress, it also manages egress which is now more critical functionality. BIG-IP Next SPK lives both inside the Kubernetes clusters as well as inside the data center network fabric. SPK provides a distributed implementation of BIG-IP, controlled as a Kubernetes resource, which understands both Kubernetes namespace-based tenancy and the network segregation tenancy required by the data center networking fabric. SPK provides a critical central point of control for networking and security ingress and egress for Kubernetes clusters, improving visibility and efficiency and reducing TCO.

Optimizing GPU performance with CPU Offload

A third major area of concern is in optimizing the infrastructure for GPU scale. HPC AI clusters including GPUs have inter-service (East-West) networking requirements which rival the bandwidth of entire geographic continents of mobile traffic. None of the related issues are new to the HPC community but will be “new” news to most enterprises or network service operator teams.

Existing DCs are designed around CPUs for serial processing, so the addition of GPUs changes the dynamics of data traffic—the network is not automatically optimized to best utilize GPU parallel processing. To facilitate connectivity to these highly engineered HPC AI clusters, new generation of network interface hardware, the SuperNIC DPU/IPU (data processing unit/interface processing unit), are being integrated into Kubernetes cluster nodes. Offloading CPU network traffic to these DPU/IPUs offers several benefits: it frees up compute, accelerates data movement, and frees up GPU capacity.

In this example of an offload scenario with SPK deployed, the AI cluster DPUs do not contain not only the new versions of F5’s data plane, but the full BIG-IP network stack. That opens the door for simplified AI service middleware deployments using BIG-IP for both the reverse proxy ingress services and the forward proxy egress services.

Conclusion and CTA

Fortifying infrastructure and bridging gaps to facilitate the massive scale and changes needed for AI workload agility, resilience, management and security to within to fit the specific requirements of AI clusters deployed at skyrocketing rates is no easy undertaking. But it must be done to prevent outages, lags and breakage that can significantly slow or imperil AI application rollouts and adoptions.

For a deeper technical dive, please read our free technical article on DevCentral "Preparing Network Infrastructures for HPC Clusters for AI"

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI基础设施 Kubernetes Ingress 多租户 GPU性能优化
相关文章