AWS Machine Learning Blog 前天 05:04
AWS AI infrastructure with NVIDIA Blackwell: Two powerful compute solutions for the next frontier of AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AWS推出了基于NVIDIA Blackwell架构的P6e-GB200 UltraServers和P6-B200实例,旨在满足日益增长的AI计算需求。P6e-GB200 UltraServers专为训练和部署大型、复杂的AI模型设计,而P6-B200实例则适用于各种AI和高性能计算工作负载。这些新产品在安全性、可靠性、性能和基础设施效率方面都进行了优化,并提供了多种部署方案,包括Amazon SageMaker HyperPod、Amazon EKS和NVIDIA DGX Cloud on AWS,以帮助用户加速AI创新。

🚀 P6e-GB200 UltraServers是迄今为止最强大的GPU产品,采用NVIDIA GB200 NVL72架构,提供高达360 petaflops的FP8计算能力和13.4 TB的GPU内存,适用于计算和内存密集型AI工作负载,如训练超大规模模型。

💡 P6-B200实例是P5en实例的升级版,适用于中型到大型规模的训练和推理工作负载。它提供8个NVIDIA Blackwell GPU,具有1.4 TB的GPU内存和高达3.2 Tbps的EFAv4网络,并支持x86环境,方便用户迁移现有GPU工作负载。

🛡️ AWS在基础设施层面构建了强大的安全性和稳定性,P6e-GB200和P6-B200都采用了第六代Nitro System,确保用户AI工作负载和数据的安全。Nitro System还支持在不中断系统的情况下进行固件更新和优化,提高生产效率。

⚡️ AWS通过第三代EC2 UltraClusters和EFAv4网络,实现了大规模的性能和可靠性。EFAv4在分布式训练中提供更快的集体通信,而P6e-GB200 UltraServers的液冷技术则提高了计算密度和系统性能。

🛠️ AWS提供了多种部署方案,如Amazon SageMaker HyperPod和Amazon EKS,方便用户根据自身需求选择。SageMaker HyperPod提供托管基础设施,简化GPU集群管理;Amazon EKS支持大规模AI工作负载,并提供自动化配置和生命周期管理。

Imagine a system that can explore multiple approaches to complex problems, drawing on its understanding of vast amounts of data, from scientific datasets to source code to business documents, and reasoning through the possibilities in real time. This lightning-fast reasoning isn’t waiting on the horizon. It’s happening today in our customers’ AI production environments. The scale of the AI systems that our customers are building today—across drug discovery, enterprise search, software development, and more—is truly remarkable. And there’s much more ahead.

To accelerate innovation across emerging generative AI developments such as reasoning models and agentic AI systems, we’re excited to announce general availability of P6e-GB200 UltraServers, accelerated by NVIDIA Grace Blackwell Superchips. P6e-GB200 UltraServers are designed for training and deploying the largest, most sophisticated AI models. Earlier this year, we launched P6-B200 instances, accelerated by NVIDIA Blackwell GPUs, for diverse AI and high-performance computing workloads.

In this post, we share how these powerful compute solutions build on everything we’ve learned about delivering secure, reliable GPU infrastructure at a massive scale, so that customers can confidently push the boundaries of AI.

Meeting the expanding compute demands of AI workloads

P6e-GB200 UltraServers represent our most powerful GPU offering to date, featuring up to 72 NVIDIA Blackwell GPUs interconnected using fifth-generation NVIDIA NVLink—all functioning as a single compute unit. Each UltraServer delivers a massive 360 petaflops of dense FP8 compute and 13.4 TB of total high bandwidth GPU memory (HBM3e)—which is over 20 times the compute and over 11 times the memory in a single NVLink domain compared to P5en instances. P6e-GB200 UltraServers support up to 28.8 Tbps aggregate bandwidth of fourth-generation Elastic Fabric Adapter (EFAv4) networking.P6-B200 instances are a versatile option for a broad range of AI use cases. Each instance provides 8 NVIDIA Blackwell GPUs interconnected using NVLink with 1.4 TB of high bandwidth GPU memory, up to 3.2 Tbps of EFAv4 networking, and fifth-generation Intel Xeon Scalable processors. P6-B200 instances offer up to 2.25 times the GPU TFLOPs, 1.27 times the GPU memory size, and 1.6 times the GPU memory bandwidth compared to P5en instances.

How do you choose between P6e-GB200 and P6-B200? This choice comes down to your specific workload requirements and architectural needs:

Innovation built on AWS core strengths

Bringing NVIDIA Blackwell to AWS isn’t about a single breakthrough—it’s about continuous innovation across multiple layers of infrastructure. By building on years of learning and innovation across compute, networking, operations, and managed services, we’ve brought NVIDIA Blackwell’s full capabilities with the reliability and performance customers expect from AWS.

Robust instance security and stability

When customers tell me why they choose to run their GPU workloads on AWS, one crucial point comes up consistently: they highly value our focus on instance security and stability in the cloud. The specialized hardware, software, and firmware of the AWS Nitro System are designed to enforce restrictions so that nobody, including anyone in AWS, can access your sensitive AI workloads and data. Beyond security, the Nitro System fundamentally changes how we maintain and optimize infrastructure. The Nitro System, which handles networking, storage, and other I/O functions, makes it possible to deploy firmware updates, bug fixes, and optimizations while it remains operational. This ability to update without system downtime, which we call live update, is crucial in today’s AI landscape, where any interruption significantly impacts production timelines. P6e-GB200 and P6-B200 both feature the sixth generation of the Nitro System, but these security and stability benefits aren’t new—our innovative Nitro architecture has been protecting and optimizing Amazon Elastic Compute Cloud (Amazon EC2) workloads since 2017.

Reliable performance at massive scale

In AI infrastructure, the challenge isn’t just reaching massive scale—it’s delivering consistent performance and reliability at that scale. We’ve deployed P6e-GB200 UltraServers in third-generation EC2 UltraClusters, which creates a single fabric that can encompass our largest data centers. Third-generation UltraClusters cut power consumption by up to 40% and reduce cabling requirements by more than 80%—not only improving efficiency, but also significantly reducing potential points of failure.

To deliver consistent performance at this massive scale, we use Elastic Fabric Adapter (EFA) with its Scalable Reliable Datagram protocol, which intelligently routes traffic across multiple network paths to maintain smooth operation even during congestion or failures. We’ve continuously improved EFA’s performance across four generations. P6e-GB200 and P6-B200 instances with EFAv4 show up to 18% faster collective communications in distributed training compared to P5en instances that use EFAv3.

Infrastructure efficiency

Whereas P6-B200 instances use our proven air-cooling infrastructure, P6e-GB200 UltraServers use liquid cooling, which enables higher compute density in large NVLink domain architectures, delivering higher system performance. P6e-GB200 are liquid cooled with novel mechanical cooling solutions providing configurable liquid-to-chip cooling in both new and existing data centers, so we can support both liquid-cooled accelerators and air-cooled network and storage infrastructure in the same facility. With this flexible cooling design, we can deliver maximum performance and efficiency at the lowest cost.

Getting started with NVIDIA Blackwell on AWS

We’ve made it simple to get started with P6e-GB200 UltraServers and P6-B200 instances through multiple deployment paths, so you can quickly begin using Blackwell GPUs while maintaining the operational model that works best for your organization.

Amazon SageMaker HyperPod

If you’re accelerating your AI development and want to spend less time managing infrastructure and cluster operations, that’s exactly where Amazon SageMaker HyperPod excels. It provides managed, resilient infrastructure that automatically handles provisioning and management of large GPU clusters. We keep enhancing SageMaker HyperPod, adding innovations like flexible training plans to help you gain predictable training timelines and run training workloads within your budget requirements.

SageMaker HyperPod will support both P6e-GB200 UltraServers and P6-B200 instances, with optimizations to maximize performance by keeping workloads within the same NVLink domain. We’re also building in a comprehensive, multi-layered recovery system: SageMaker HyperPod will automatically replace faulty instances with preconfigured spares in the same NVLink domain. Built-in dashboards will give you visibility into everything from GPU utilization and memory usage to workload metrics and UltraServer health status.

Amazon EKS

For large-scale AI workloads, if you prefer to manage your infrastructure using Kubernetes, Amazon Elastic Kubernetes Service (Amazon EKS) is often the control plane of choice. We continue to drive innovations in Amazon EKS with capabilities like Amazon EKS Hybrid Nodes, which enable you to manage both on-premises and EC2 GPUs in a single cluster—delivering flexibility for AI workloads.

Amazon EKS will support both P6e-GB200 UltraServers and P6-B200 instances with automated provisioning and lifecycle management through managed node groups. For P6e-GB200 UltraServers, we’re building in topology awareness that understands the GB200 NVL72 architecture, automatically labeling nodes with their UltraServer ID and network topology information to enable optimal workload placement. You will be able to span node groups across multiple UltraServers or dedicate them to individual UltraServers, giving you flexibility in organizing your training infrastructure. Amazon EKS monitors GPU and accelerator errors and relays them to the Kubernetes control plane for optional remediation.

NVIDIA DGX Cloud on AWS

P6e-GB200 UltraServers will also be available through NVIDIA DGX Cloud. DGX Cloud is a unified AI platform optimized at every layer with multi-node AI training and inference capabilities and NVIDIA’s complete AI software stack. You benefit from NVIDIA’s latest optimizations, benchmarking recipes, and technical expertise to improve efficiency and performance. It offers flexible term lengths along with comprehensive NVIDIA expert support and services to help you accelerate your AI initiatives.

This launch announcement is an important milestone, and it’s just the beginning. As AI capabilities evolve rapidly, you need infrastructure built not just for today’s demands but for all the possibilities that lie ahead. With innovations across compute, networking, operations, and managed services, P6e-GB200 UltraServers and P6-B200 instances are ready to enable these possibilities. We can’t wait to see what you will build with them.

Resources


About the author

David Brown is the Vice President of AWS Compute and Machine Learning (ML) Services. In this role he is responsible for building all AWS Compute and ML services, including Amazon EC2, Amazon Container Services, AWS Lambda, Amazon Bedrock and Amazon SageMaker. These services are used by all AWS customers but also underpin most of AWS’s internal Amazon applications. He also leads newer solutions, such as AWS Outposts, that bring AWS services into customers’ private data centers.

David joined AWS in 2007 as a Software Development Engineer based in Cape Town, South Africa, where he worked on the early development of Amazon EC2. In 2012, he relocated to Seattle and continued to work in the broader Amazon EC2 organization. Over the last 11 years, he has taken on larger leadership roles as more of the AWS compute and ML products have become part of his organization.

Prior to joining Amazon, David worked as a Software Developer at a financial industry startup. He holds a Computer Science & Economics degree from the Nelson Mandela University in Port Elizabeth, South Africa.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AWS NVIDIA Blackwell GPU服务器 AI
相关文章