Nvidia Debuts Enterprise Reference Architectures to Build AI Factories

EnterpriseAI 2024年10月30日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

为了帮助企业以可预测的方式扩展加速计算投资，GPU 巨头英伟达及其服务器合作伙伴借鉴了高性能计算 (HPC) 的经验，推出了企业参考架构 (ERA)。ERA 提供了蓝图，帮助客户以可预测且可重复的方式扩展其 HPC 计算基础设施，从而最大程度地降低风险并提高效率。ERA 计划将加速服务器制造商的上市时间，同时提高性能、可扩展性和可管理性。英伟达表示，这种方法还增强了安全性，同时降低了复杂性。目前，英伟达已与戴尔科技、惠普企业、联想和超微达成 ERA 协议，预计将有更多服务器制造商加入该计划。

👨‍💻 **英伟达企业参考架构 (ERA) 简介**：ERA 旨在为企业提供可预测且可重复的方式扩展其 HPC 计算基础设施，最大程度地降低风险并提高效率。它提供了一个蓝图，帮助客户构建 AI 工厂，并加速服务器制造商的上市时间，同时提高性能、可扩展性和可管理性。

🏗️ **ERA 的关键优势**：ERA 旨在为企业构建 AI 工厂提供一个可靠的蓝图，并提供以下优势： - **可预测性和可重复性**: ERA 提供预先验证的服务器配置，确保企业可以轻松地扩展其计算基础设施，并避免在部署过程中出现意外情况。 - **性能和可扩展性**: ERA 利用英伟达的 Spectrum-X AI 以太网平台和 BlueField-3 DPU 等技术，确保高性能和可扩展性，以满足企业对 AI 计算的需求。 - **安全性**: ERA 采用经过验证的硬件和软件组件，并提供安全设计，确保企业数据和应用程序的安全。 - **降低复杂性**: ERA 通过提供预先配置的解决方案，简化了部署过程，降低了企业的管理和维护成本。

🤝 **合作伙伴关系和应用场景**：ERA 计划已与戴尔科技、惠普企业、联想和超微等服务器制造商达成合作，为企业提供经过验证的解决方案。ERA 适用于各种规模的企业，从小型初创公司到大型企业，满足其对 AI 计算的需求。ERA 的应用场景包括： - 大型语言模型 (LLM) 的训练和推理 - 图像识别和处理 - 自然语言处理 - 机器学习和深度学习等领域。

📡 **ERA 的技术细节**：ERA 利用经过认证的服务器配置，包括英伟达的 GPU、CPU 和网络接口卡 (NIC)，确保性能和可扩展性。ERA 支持多种设计模式，例如“2-4-3”和“2-8-5”，以满足不同规模的集群需求。ERA 的设计目标是将企业的数据中心转变为 AI 工厂，以满足不断增长的 AI 计算需求。

The advent of generative AI has supersized the appetite for GPUs and other forms of accelerated computing. To help companies scale up their accelerated compute investments in a predictable manner, GPU giant Nvidia and several server partners have borrowed a page from the world of high performance computing (HPC) and unveiled the Enterprise Reference Architectures (ERA).

Large language models (LLMs) and other foundation models have triggered a Gold Rush for GPUs, and Nvidia has been arguably the biggest beneficiary. In 2023, the company shipped 3.76 million data center GPU units, more than 1 million more than 2022. That growth hasn’t eased up in 2024, as companies continue to scramble for GPUs to power GenAI, which has driven Nvidia to become the most valuable company in the world, with a market capitalization of $3.75 trillion.

Nvidia launched its ERA program today against this backdrop of a mad scramble to scale up compute to build and serve GenAI applications. The company’s goal with ERA is to provide a blueprint to help customers scale up their HPC compute infrastructure in a predictable and repeatable manner that minimizes risk and maximizes results.

Nvidia says the ERA program will accelerate time to market for server makers while boosting performance, scalability, and manageability. This approach also bolsters security, Nvidia says, while reducing complexity. So far, Nvidia has ERA agreements in place for Dell Technologies, Hewlett Packard Enterprise, Lenovo, and Supermicro, with more server makers expected to join the program.

“By bringing the same technical components from the supercomputing world and packaging them with design recommendations based on decades of experience,” the company says in a white paper on ERA, “Nvidia’s goal is to eliminate the burden of building these systems from scratch with a streamlined approach for flexible and cost-effective configurations, taking the guesswork and risk out of deployment.”

One of the Nvidia ERA reference configurations (Source: Nvidia ERA Overview whitepaper)

The ERA approach leverages certified server configurations of GPUs, CPUs, and network interface cards (NICs) that Nvidia says are “tested and validated to deliver performance at scale.” This includes the Nvidia Spectrum-X AI Ethernet platform, Nvidia BlueField-3 DPUs, among others.

The ERA is tailored toward large-scale deployments that range from four to 128 nodes, containing anywhere from 32 to 1,024 GPUs, according to Nvidia’s white paper. This is sweet spot where the company sees companies turning their data centers into “AI factories.” It’s also a bit smaller than the company’s existing NCP Reference Architecture, which is designed for larger-scale foundational model training starting with a minimum of 128 nodes and scaling up to 100,000 GPUs.

ERA calls for several different design patterns, depending on the size of the cluster. For instance, there is Nvidia’s “2-4-3” approach, which includes a 2U compute node that contains up to four GPUs, up to three NICs, and two CPUs. Nvidia says this can work on clusters ranging from eight to 96 nodes. Alternatively, there is the 2-8-5 design pattern, which calls for 4U nodes equipped with up to eight GPUs, five NICs, and two CPUs. This pattern scales from four up to 64 nodes in a cluster, Nvidia says.

Partnering with server makers on proven architectures for accelerated compute helps to move customers toward their goal of building AI factories in a fast and secure manner, according to Nvidia.

“The transformation of traditional data centers into AI Factories is revolutionizing how enterprises process and analyze data by integrating advanced computing and networking technologies to meet the substantial computational demands of AI applications,” the company says in its white paper.

This article first appeared on sister site BigDATAwire.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签