Nvidia Blog 03月20日 13:14
New NVIDIA Software for Blackwell Infrastructure Runs AI Factories at Light Speed
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

英伟达发布NVIDIA Mission Control,这是一个统一的运营和编排软件平台,旨在自动化AI数据中心和工作负载的复杂管理,加速AI工厂的运营。它增强了AI工厂运营的各个方面,从配置部署到验证基础设施再到运行开发者工作负载,帮助企业更快地启动和运行前沿模型。Mission Control能够轻松地将基于NVIDIA Blackwell的系统从预训练过渡到后训练,并实现快速高效的测试时扩展。此外,还包括NVIDIA Run:ai技术,可简化开发、训练和推理的运营和作业编排,将基础设施利用率提高多达5倍。

⚙️ NVIDIA Mission Control是一个统一的运营和编排软件平台,旨在自动化AI数据中心和工作负载的复杂管理,帮助企业更快地部署和运行AI模型。

🚀 Mission Control能够轻松地将基于NVIDIA Blackwell的系统从预训练过渡到后训练,并支持测试时扩展,同时允许企业在基于Blackwell的NVIDIA DGX系统和NVIDIA Grace Blackwell系统上轻松切换训练和推理工作负载,动态重新分配集群资源以适应不断变化的优先级。

💡 Mission Control集成了NVIDIA Run:ai技术,可简化开发、训练和推理的运营和作业编排,从而将基础设施利用率提高多达5倍,提高AI训练和推理效率,保证AI应用程序的持续运行。

🛡️ Mission Control的自主恢复功能,在快速检查点和自动分层重启功能的支持下,与依赖手动干预的传统方法相比,作业恢复速度提高了10倍,从而最大限度地提高了开发人员的生产力和基础设施弹性。

The industrial age was fueled by steam. The digital age brought a shift through software. Now, the AI age is marked by the development of generative AI, agentic AI and AI reasoning, which enables models to process more data to learn and reason to solve complex problems.

Just as industrial factories transform raw materials into goods, modern businesses require AI factories to quickly transform data into insights that are scalable, accurate and reliable.

Orchestrating this new infrastructure is far more complex than it was to build steam-powered factories. State-of-the-art models demand supercomputing-scale resources. Any downtime risks derailing weeks of progress and reducing GPU utilization.

To enable enterprises and developers to manage and run AI factories at light speed, NVIDIA today announced at the NVIDIA GTC global AI conference NVIDIA Mission Control — the only unified operations and orchestration software platform that automates the complex management of AI data centers and workloads.

NVIDIA Mission Control enhances every aspect of AI factory operations. From configuring deployments to validating infrastructure to operating developer workloads, its capabilities help enterprises get frontier models up and running faster.

It is designed to easily transition NVIDIA Blackwell-based systems from pretraining to post-training — and now test-time scaling — with speed and efficiency. The software enables enterprises to easily pivot between training and inference workloads on their Blackwell-based NVIDIA DGX systems and NVIDIA Grace Blackwell systems, dynamically reallocating cluster resources to match shifting priorities.

In addition, Mission Control includes NVIDIA Run:ai technology to streamline operations and job orchestration for development, training and inference, boosting infrastructure utilization by up to 5x.

Mission Control’s autonomous recovery capabilities, supported by rapid checkpointing and automated tiered restart features, can deliver up to 10x faster job recovery compared with traditional methods that rely on manual intervention, boosting AI training and inference efficiency to keep AI applications in operation.

Built on decades of NVIDIA supercomputing expertise, Mission Control lets enterprises simply run models by minimizing time spent managing AI infrastructure. It automates the lifecycle of AI factory infrastructure for all NVIDIA Blackwell-based NVIDIA DGX systems and NVIDIA Grace Blackwell systems from Dell Technologies, Hewlett Packard Enterprise (HPE), Lenovo and Supermicro to make advanced AI infrastructure more accessible to the world’s industries.

Enterprises can further simplify and speed deployments of NVIDIA DGX GB300 and DGX B300 systems by using Mission Control with the NVIDIA Instant AI Factory service preconfigured in Equinix AI-ready data centers across 45 markets globally.

Advanced Software Provides Enterprises Uninterrupted Infrastructure Oversight  

Mission Control automates end-to-end infrastructure management — including provisioning, monitoring and error diagnosis — to deliver uninterrupted operations. Plus, it continuously monitors every layer of the application and infrastructure stack to predict and identify sources of downtime and inefficiency — saving time, energy and costs.

Additional NVIDIA Mission Control software benefits include:

Leading System Makers Bring NVIDIA Mission Control to Grace Blackwell Servers  

Leading system makers plan to offer NVIDIA GB200 NVL72 and GB300 NVL72 systems with NVIDIA Mission Control.

Dell plans to offer NVIDIA Mission Control software as part of the Dell AI Factory with NVIDIA.

“The AI industrial revolution demands efficient infrastructure that adapts as fast as business evolves, and the Dell AI Factory with NVIDIA delivers with comprehensive compute, networking, storage and support,” said Ihab Tarazi, chief technology officer and senior vice president at Dell Technologies. “Pairing NVIDIA Mission Control software and Dell PowerEdge XE9712 and XE9680 servers helps enterprises scale models effortlessly to meet the demands of both training and inference, turning data into actionable insights faster than ever before.”

HPE will offer the NVIDIA GB200 NVL72 by HPE and GB300 NVL72 by HPE systems with NVIDIA Mission Control software.

“We are helping service providers and cutting-edge enterprises to rapidly deploy, scale, and optimize complex AI clusters capable of training trillion parameter models,” said Trish Damkroger, senior vice president and general manager, HPC & AI Infrastructure Solutions at HPE. “As part of our collaboration with NVIDIA, we will deliver NVIDIA Grace Blackwell rack-scale systems and Mission Control software with HPE’s global services and direct liquid cooling expertise to power the new AI era.”

Lenovo plans to update its Lenovo Hybrid AI Advantage with NVIDIA systems to include NVIDIA Mission Control software.

“Bringing NVIDIA Mission Control software to Lenovo Hybrid AI Advantage with NVIDIA systems empowers enterprises to navigate the demands of generative and agentic AI workloads with unmatched agility,” said Brian Connors, worldwide vice president and general manager of enterprise and SMB segment and AI, infrastructure solutions group, at Lenovo. “By automating infrastructure orchestration and enabling seamless transitions between training and inference workloads, Lenovo and NVIDIA are helping customers scale AI innovation at the speed of business.”

Supermicro plans to incorporate NVIDIA Mission Control software into its Supercluster systems.

“Supermicro is proud to team with NVIDIA on a Grace Blackwell NVL72 system that is fully supported by NVIDIA Mission Control software,” Cenly Chen, chief growth officer at Supermicro. “Running on Supermicro’s AI SuperCluster systems with NVIDIA Grace Blackwell, NVIDIA Mission Control software provides customers with a seamless management software suite to maximize performance on both current NVIDIA GB200 NVL72 systems and future platforms such as NVIDIA GB300 NVL72.”

Base Command Manager Offers Free Kickstart for AI Cluster Management

To help enterprises with infrastructure management, NVIDIA Base Command Manager software is expected to soon be available for free for up to eight accelerators per system, for any cluster size, with the option to purchase NVIDIA Enterprise Support separately.

Availability

NVIDIA Mission Control for NVIDIA DGX GB200 and DGX B200 systems is available now. NVIDIA GB200 NVL72 systems with Mission Control are expected to soon be available from Dell, HPE, LeNewfonovo and Supermicro.

NVIDIA Mission Control is expected to become available for the latest NVIDIA DGX GB300 and DGX B300 systems, as well as GB300 NVL72 systems from leading global providers, later this year.

See notice regarding software product information.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA Mission Control AI工厂 AI基础设施 Blackwell AI运营
相关文章