Nvidia’s Blackwell Platform Powers AI Progress in Open Compute Project

EnterpriseAI 2024年10月16日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Nvidia宣布将Blackwell加速计算平台的关键设计元素贡献给Open Compute Project (OCP)。该公司表示，其全机架Blackwell系统（称为GB200 NVL72）的关键设计部分将贡献给OCP社区。这些设计信息包括GB200 NVL72系统的机架架构、计算和交换机托盘机械结构、液冷和热环境规格以及NVLink电缆盒体积。该系统旨在训练高达27万亿参数的LLM，由36个GB200 Grace Blackwell超级芯片组成，这些芯片通过36个Grace CPU和72个Blackwell GPU互连。Nvidia表示，该系统提供720 petaflops的训练性能和1.4 exaflops的推理性能。该系统采用液冷，其NVLink互连技术带宽为1.8TB/s，使其能够作为一个巨大的GPU运作。Nvidia还宣布将扩大NVIDIA Spectrum-X对OCP标准的支持。

🚀 **Blackwell平台关键设计贡献**：Nvidia贡献了其全机架Blackwell系统（GB200 NVL72）的关键设计元素，包括机架架构、计算和交换机托盘机械结构、液冷和热环境规格以及NVLink电缆盒体积。此举旨在推动开放硬件在AI和HPC领域的应用，并促进Nvidia技术的更广泛应用。通过共享Blackwell平台的关键设计，Nvidia能够与其他开放系统实现更好的互操作性，并提高数据中心效率。此外，Nvidia还宣布扩大NVIDIA Spectrum-X对OCP标准的支持，进一步推动了开放标准的应用。

🤖 **GB200 NVL72系统性能和特点**：GB200 NVL72系统旨在训练高达27万亿参数的LLM，由36个GB200 Grace Blackwell超级芯片组成，这些芯片通过36个Grace CPU和72个Blackwell GPU互连。该系统提供720 petaflops的训练性能和1.4 exaflops的推理性能。该系统采用液冷，其NVLink互连技术带宽为1.8TB/s，使其能够作为一个巨大的GPU运作。Blackwell GPU是Nvidia最新的芯片，包含2080亿个晶体管，采用台积电的4nm工艺制造。单个Blackwell GPU可以训练一个1万亿参数的模型，据Nvidia称，其速度比之前的Hopper GPU（或H100）快30倍。与H100相比，该芯片的能耗更低。

🤝 **与OCP合作推动AI和HPC发展**：Nvidia与OCP合作已有十多年，该公司一直与行业领导者共同制定规范和设计，这些规范和设计可以在整个数据中心广泛采用。 Nvidia表示，通过推进开放标准，他们正在帮助全球各地的组织充分利用加速计算的潜力，并创建未来的AI工厂。 Nvidia对OCP的贡献代表了推动AI和HPC开放硬件发展的重要趋势。通过共享Blackwell平台的设计元素，Nvidia能够让更多组织使用其技术，并提高与其他开放系统的互操作性。

💡 **Blackwell平台的实际应用**：Nvidia最近宣布Blackwell平台已全面投入生产。在台湾，Nvidia和台湾电子制造商富士康正在共同建设台湾最大的超级计算项目——鸿海高雄超级计算中心。该项目将基于Blackwell架构，包括64个GB200 NVL72机架和4,608个Tensor Core GPU。据Nvidia称，富士康计划将该超级计算机用于癌症研究、LLM开发和智慧城市创新等领域，预计将于2026年全面部署。

🚀 **推动AI发展， democratizing access to AI**：随着AI模型规模和复杂性的不断增加，尤其是多万亿参数模型的出现，对更强大、更可扩展的计算基础设施的需求变得至关重要。将AI民主化不仅意味着让软件更易获得，还意味着确保训练和部署这些模型所需的硬件在更广泛的组织范围内触手可及。科学AI尤其需要强大的计算基础设施，能够处理海量数据集和复杂的模型，突破传统架构的局限。Nvidia对OCP的贡献有助于弥合这一差距，通过促进开放、可扩展的解决方案，使更多机构能够使用先进硬件来参与AI，以进行科学发现。

Nvidia announced it has contributed foundational elements of its Blackwell accelerated computing platform design to the Open Compute Project (OCP).

Shared at the OCP Global Summit in San Jose today, Nvidia said that key portions of the design of its full rack Blackwell system, called the GB200 NVL72, will be contributed to the OCP community. This design information includes the GB200 NVL72 system’s rack architecture, compute and switch tray mechanicals, liquid cooling and thermal environment specs, and the NVLink cable cartridge volumetrics.

Designed to train up to 27 trillion-parameter LLMs, Nvidia’s GB200 NVL72 rack consists of 36 GB200 Grace Blackwell superchips interconnected with 36 Grace CPUs and 72 Blackwell GPUs. Nvidia says it delivers 720 petaflops of training performance and 1.4 exaflops of inferencing performance. The system is liquid cooled and its NVLink interconnect technology, with a bandwidth of 1.8TB/s, allows it to act as a single massive GPU.

The namesake of the Blackwell rack is the Blackwell GPU, Nvidia’s newest chip containing 208 billion transistors made with TSMC’s 4nm process. A single Blackwell GPU can train a 1-trillion-parameter model, according to Nvidia, and is up to 30x faster than the preceding Hopper GPUs, or H100. The chips require less energy than the H100, Nvidia claims, noting that training a 1.8 trillion parameter model once would have taken 8,000 Hopper GPUs and 15 megawatts of power, whereas now it would take only 2,000 Blackwell GPUs at a power consumption of 4 megawatts.

The GB200 NVL72 rack. Source: Nvidia

Other notable Nvidia contributions to OCP include the NVIDIA HGX H100 baseboard, which has become the de facto baseboard standard for AI servers, and the NVIDIA ConnectX-7 adapter, which now serves as the foundation design of the OCP Network Interface Card (NIC) 3.0. The company also announced it would broaden NVIDIA Spectrum-X support for OCP standards.

“Building on a decade of collaboration with OCP, Nvidia is working alongside industry leaders to shape specifications and designs that can be widely adopted across the entire data center,” said Nvidia founder and CEO Jensen Huang. “By advancing open standards, we’re helping organizations worldwide take advantage of the full potential of accelerated computing and create the AI factories of the future.”

Nvidia's contribution to OCP represents an important trend in advancing open hardware for AI and HPC. By sharing elements of its Blackwell platform, Nvidia is enabling broader access to its technology, which could enhance interoperability with other open systems. This contribution may also help improve data center efficiency by making its energy-efficient, AI-optimized architecture available in open designs. The contribution also supports the ongoing growth of the AI and HPC ecosystem, offering developers and organizations more options to leverage advanced computing technologies for AI applications in scientific research and large-scale computing.

A compute tray in the GB200 NVL72 rack. Source: Nvidia

As AI models grow in size and complexity, particularly with the advent of multi-trillion parameter models, the need for more powerful and scalable computing infrastructure becomes critical. Democratizing access to AI is not just about making software more available but also about ensuring that the hardware required to train and deploy these models is within reach for a wider range of organizations. Scientific AI, in particular, demands robust computing infrastructure capable of handling vast datasets and sophisticated models, pushing the limits of traditional architectures. Contributions like Nvidia’s to OCP help address this gap by fostering open, scalable solutions that make advanced hardware more accessible to enable more institutions to participate in AI for scientific discovery.

At the OCP Global Summit, Nvidia also announced that Blackwell is now in full production. A recent example of the promising use of the GB200 NVL72 platform is in Taiwan, where Nvidia and Taiwanese electronics manufacturer Foxconn are building what they call the island’s largest supercomputing project, the Hon Hai Kaohsiung Super Computing Center. The project will be built around the Blackwell architecture and will include 64 GB200 NVL72 racks and 4, 608 Tensor Core GPUs.

Foxconn plans to use the supercomputer, housed in Kaohsiung, Taiwan, to power breakthroughs in cancer research, LLM development, and smart city innovations, according to Nvidia, with full deployment expected by 2026.

To learn more about NVIDIA’s GB200 NVL72 OCP contribution check out the OCP specification documentation here.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签