AiThority 2024年09月26日
Volumez Dominates MLCommons MLPerf Storage Benchmark, Redefining AI/ML Training Performance
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Volumez在MLCommons MLPerf Storage 1.0 AI/ML训练基准测试中取得突破性成果,展示其在数据基础设施性能和GPU利用率方面的领先地位。

🎯Volumez作为Data Infrastructure as a Service(DIaaS)的先驱领导者,在最新的MLCommons MLPerf Storage 1.0 AI/ML训练基准测试中表现出色。其解决方案在存储密集型3D-UNet模型的MLPerf Storage 1.0开放分区基准测试中,实现了非凡的线性扩展,达到了1.079 TB/秒的吞吐量,GPU利用率为92.21%,IOPS为990万,为AI/ML训练性能树立了新的行业标准。

💪Volumez部署了137个应用节点和128个媒体节点,其DIaaS解决方案未在Linux数据路径中添加额外层,而是利用云感知智能优化基础设施,为3D-UNet工作量提供了前所未有的速度和效率,改变了AI/ML训练环境的经济性和可扩展性。

🌟Volumez作为MLCommons社区的活跃成员,提交了第二次基准测试运行,通过调整权重交换频率,减少网络开销,实现了1.140 TB/秒的吞吐量和97.82%的GPU利用率,比第一次提交有了显著提高。

🚀Volumez从根本上重新定义了AI/ML训练基础设施的标准,消除了传统瓶颈,提供行业领先的性能,使组织能够加速其AI/ML计划,并在快速发展的市场中获得竞争优势。

Setting new standards for data infrastructure performance and GPU utilization in AI/ML training workloads

Volumez, a pioneering leader in Data Infrastructure as a Service (DIaaS), announced groundbreaking performance results from the latest MLCommons MLPerf Storage 1.0 AI/ML Training Benchmark. These results highlight Volumez’ commitment to delivering next-generation, cloud-aware data infrastructure for AI/ML workloads, pushing the boundaries of performance, scalability, and efficiency. Volumez also announced today its unique capabilities as the leading Data Infrastructure as a Service (DIaaS) company. Read the press release and blog.

Also Read: Intel Unveils Next-Generation AI Solutions with the Launch of Xeon 6 and Gaudi 3

AI/ML training remains one of the most demanding workloads in modern data infrastructure. Maximizing throughput to drive optimal GPU utilization is critical for accelerating model training, improving accuracy, and reducing operational costs. In the MLPerf Storage 1.0 Open Division benchmark, which focused on the storage-intensive 3D-UNet model, Volumez DIaaS for AI/ML demonstrated extraordinary linear scaling. The solution achieved a remarkable 1.079 TB/sec throughput with 92.21% GPU utilization and 9.9M IOPS1 on AWS—establishing a new industry standard for AI/ML training performance.

Benchmark Overview and Industry Impact

Volumez deployed 137 application nodes (c5n.18xlarge), each simulating 3 H100 GPUs, streaming data from 128 media nodes (i3en.24xlarge) equipped with 60TB of storage per node. Unlike traditional architectures, Volumez’ DIaaS solution introduces no additional layers to the Linux data path and leverages cloud-aware intelligence to optimize infrastructure for the 3D-UNet workload. This unique approach delivered a level of speed and efficiency previously unseen in the benchmark, transforming both the economics and scalability of AI/ML training environments.

“These results mark a significant achievement for Volumez,” said John Blumenthal, chief product and business officer at Volumez. “The performance and scalability achieved during testing are unprecedented and highlight the critical role Volumez plays in the AI/ML ecosystem, providing solutions that meet the growing demands of AI/ML workloads on cloud infrastructure – to maximize the yield on our industry’s scarcest resource, GPUs.”

MLPerf Benchmark Achievements include:

“We are excited to have Volumez participate in their first MLPerf Storage benchmark. The importance that storage plays in the AI technology stack and the innovations happening in this space are critical to the successful deployment of ML training systems. MLPerf benchmark results are important measures for storage consumers to analyze as they seek to procure and tune ML systems to maximize their utilization – and ultimately their return on investment,” said David Kanter, head of MLPerf at MLCommons.

Also Read: Explainable AI (XAI) in Security Applications

Industry Perspective

According to Gartner, “From a feature and functionality perspective, storage for GenAI is not too different from storage for any other analytics applications. The exception is that the performance capabilities required to feed the compute farm become even more relevant for GenAI and can be amplified at a larger scale. The training stage of GenAI workflow can be very demanding from a performance point of view, depending on the model size. Not only must the storage layer support high throughput to feed the CPU or GPU farm, but it also must have the right performance to support model checkpoint and recovery fast enough to keep the computer farm running.”2 Solutions like Volumez DIaaS are essential for enabling the next generation of AI infrastructure that balances performance, scalability, and cost.

Innovative Results in Real-World Environments

As an active member of the MLCommons community, Volumez took an additional step by submitting a second benchmark run in the Open Division. This submission focused on addressing real-world trade-offs faced by ML engineers and MLOps teams – optimizing throughput and utilization without sacrificing model accuracy. Specifically, we modified the benchmark’s weight exchange frequency, a common practice in high-scale environments. This adjustment reduces network overhead to achieve increased throughput and GPU utilization. We delivered an impressive 1.140 TB/sec throughput and 97.82% GPU utilization1, a 5.43% improvement over our first submission.

For a deeper dive into the breakthrough architecture that powers Volumez’ DIaaS for AI/ML, explore our white paper here. The document provides comprehensive insights into how our cloud-aware control plane drives transformative results for AI/ML workloads at scale.

Revolutionizing AI/ML Infrastructure

Volumez has fundamentally redefined the standards for AI/ML training infrastructure. By eliminating traditional bottlenecks and delivering industry-leading performance, the Volumez DIaaS platform empowers organizations to accelerate their AI/ML initiatives and gain a competitive edge in a rapidly evolving market.

1 Results verified by MLCommons Association. Retrieved from https://mlcommons.org/benchmarks/storage/. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

2 Gartner, 2024 Strategic Roadmap for Storage, By Jeff Vogel, Julia Palmer, Michael Hoeck, Chandra Mukhyala, 23 February 2024. GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

[To share your insights with us as part of editorial or sponsored content, please write to psen@itechseries.com]

The post Volumez Dominates MLCommons MLPerf Storage Benchmark, Redefining AI/ML Training Performance appeared first on AiThority.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Volumez AI/ML训练 数据基础设施 性能突破
相关文章