MarkTechPost@AI 2024年09月03日
Why GPU Utilization Falls Short: Understanding Streaming Multiprocessor (SM) Efficiency for Better LLM Performance
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了GPU利用率作为衡量大型语言模型(LLM)训练效率的指标的局限性,并强调了流式多处理器(SM)效率的重要性。研究人员发现,传统的GPU利用率指标并不能准确反映实际的计算效率,因为即使GPU达到100%利用率,也可能仅进行内存读写操作而没有进行实际计算。为了更准确地评估GPU性能,研究人员引入了模型浮点运算次数(FLOPS)利用率(MFUs)作为替代指标。然而,MFUs的计算复杂度较高,且依赖于模型参数和框架。研究人员通过案例分析发现,一些GPU利用率达到100%的LLM训练,其MFUs仅为20%,远低于大多数LLM训练的典型范围(35-45%)。

🤔 **GPU利用率的局限性**: GPU利用率指标(如nvidia-smi或集成可观测性工具)无法准确反映实际计算效率。即使GPU达到100%利用率,也可能仅进行内存读写操作而没有进行实际计算。

📈 **模型FLOPS利用率(MFUs)**: MFUs衡量了系统在峰值FLOPS下运行时,观察到的吞吐量与理论最大吞吐量的比率,提供更准确的GPU性能指标。然而,MFUs的计算复杂度较高,且依赖于模型参数和框架。

🚀 **流式多处理器(SM)效率**: SM效率是衡量NVIDIA GPU在给定时间间隔内活动SM百分比的指标。SM效率反映了CUDA内核利用可用SM的有效性。研究人员发现,即使GPU利用率很高,SM效率却很低,表明模型执行存在潜在的低效率问题。

💡 **优化策略**: 研究人员建议通过融合Transformer块中的层来优化LLM训练。这涉及将PyTorch原生层定义替换为CUDA或Triton中实现的GPU内核,将多个层合并到一个内核中。融合内核可以提高性能并减少内存使用。

📊 **性能提升**: 通过融合内核和微调模型并行性,研究人员实现了训练时间4倍的加速,并将模型FLOPS利用率(MFUs)从20%提高到38%。

Large Language Models (LLMs) have gained significant prominence in recent years, driving the need for efficient GPU utilization in machine learning tasks. However, researchers face a critical challenge in accurately assessing GPU performance. The commonly used metric, GPU Utilization, accessed through nvidia-smi or integrated observability tools, has proven to be an unreliable indicator of actual computational efficiency. Surprisingly, 100% GPU utilization can be achieved merely by reading and writing to memory without performing any computations. This revelation has sparked a reevaluation of performance metrics and methodologies in the field of machine learning, prompting researchers to seek more accurate ways to measure and optimize GPU performance for LLM training and inference tasks.

Researchers have attempted to address the limitations of GPU Utilization by introducing alternative metrics. One widely known approach is the Model FLOPS (Floating point Operations Per Second) utilization, or MFUs, introduced in Google’s PaLM paper. MFUs measure the ratio of observed throughput to the theoretical maximum throughput of a system operating at peak FLOPs, providing a more accurate representation of GPU performance. This metric offers insights into how efficiently a workload utilizes a GPU’s computational capabilities. However, MFUs have a drawback in their complexity of calculation, as they are parameter and framework-dependent. Despite this limitation, MFUs have revealed significant discrepancies between GPU utilization and actual computational efficiency. For instance, some LLM trainings achieving 100% GPU utilization were found to have only 20% MFUs, far below the typical 35-45% range for most LLM trainings, highlighting the need for a deeper understanding of GPU performance metrics.

Trainy AI researchers (a company specializing in GPU cluster management infrastructure) tackled the challenge of optimizing LLM training efficiency for a foundation model company. Their approach involved implementing a series of performance-tuning techniques commonly recommended for PyTorch. These optimizations included saturating the GPU by adjusting dataloader parameters, maximizing tensor core usage through mixed precision training, employing fused optimizers from apex or deepspeed, and utilizing instances and networking specifically designed for training tasks. By applying these methods, Trainy successfully achieved 100% GPU utilization and significant power draw, initially indicating improved performance. However, to gain a more comprehensive understanding of the actual computational efficiency, the team went a step further by calculating the Model FLOPS utilization (MFUs) of the training workload, recognizing the limitations of relying solely on GPU utilization as a performance metric.

GPU architecture is key to understanding the limitations of GPU utilization as a performance metric. GPUs consist of cores and multiprocessing managers (SMs in NVIDIA, CUs in AMD). The GH100 GPU, for example, has 144 SMs, each managing multiple CUDA cores. NVIDIA’s definition of GPU utilization is vague, while Datadog’s NVML documentation provides more clarity. However, this metric can be misleading as it only indicates GPU activity, not computational efficiency. When a CUDA kernel is launched, work is distributed across cores by SMs, but the utilization percentage doesn’t reflect the intensity or effectiveness of these computations.

To further investigate performance bottlenecks, researchers turned to profiling the model’s training loop using PyTorch Profiler. This analysis revealed a critical insight: the Softmax kernel was registering high GPU utilization but low SM (Streaming Multiprocessor) efficiency. This discrepancy raised concerns, as naive Softmax implementation is a well-known bottleneck for Large Language Models. The low SM efficiency indicated potential inefficiencies in the model’s execution, despite high GPU utilization. This observation aligns with the limitations of relying solely on GPU utilization as a performance metric. To address such memory-bound operations, various kernel fusion techniques like FlashAttention have been developed. The profiling results emphasized the need for a more nuanced approach to optimizing LLM training, focusing on improving SM efficiency alongside GPU utilization.

SM efficiency, also known as SM activity, is a crucial metric for NVIDIA GPUs that measures the percentage of active SMs in a given time interval. For instance, an NVIDIA H100 GPU contains 132 SMs, each managing 128 cores, totaling 16,896 cores. This metric provides insights into how effectively CUDA kernels utilize available SMs. A CUDA kernel running continuously for 10 seconds but using only 1 SM on an H100 would show 100% GPU utilization, but merely 0.7% SM efficiency. This discrepancy highlights the importance of looking beyond GPU utilization. By monitoring SM efficiency layer by layer, researchers can identify potential optimization opportunities and low-hanging fruits in LLM training, enabling more targeted performance improvements and a more accurate assessment of computational efficiency.

To optimize LLM training, researchers focused on fusing layers within the transformer block. This approach involves replacing PyTorch native layer definitions with GPU kernels implemented in CUDA or Triton, combining multiple layers into a single kernel. The optimization targets included Softmax (using Flash Attention), MLP, and dropout layer norm residual add operations. These fused kernels, often available in libraries like Flash Attention, offer improved performance and reduced memory usage.

Implementation challenges primarily involved identifying appropriate layers for replacement, as torch.compile’s automatic optimizations were incompatible with newer distributed strategies like FSDP. Manual implementation of fused kernels was necessary due to these limitations.

The optimization efforts yielded significant improvements: a 4x speedup in training time and an increase in Model FLOPS Utilization (MFU) from 20% to 38%. These gains resulted from the implementation of fused kernels and fine-tuning model parallelism to leverage the available 3.2 Tbps Infiniband infrastructure effectively.

In this study, researchers recommend tracking SM Efficiency and GPU Utilization on GPU clusters to measure performance accurately. While GPU Utilization indicates if the machine is idle, SM Efficiency shows how effectively the GPU is used. Calculating MFUs is beneficial but complex for continuous monitoring. Nvidia’s Data Center GPU Manager (DCGM) tracks SM activity by default. Other metrics like SM occupancy provide detailed insights into each SM’s workload but are more complex to interpret. For deeper understanding, refer to the Pytorch Profiler blog, DCGM documentation, and Nsight’s profiling guides.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

Here is a highly recommended webinar from our sponsor: ‘Building Performant AI Applications with NVIDIA NIMs and Haystack’

The post Why GPU Utilization Falls Short: Understanding Streaming Multiprocessor (SM) Efficiency for Better LLM Performance appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPU利用率 流式多处理器效率 大型语言模型 LLM训练 模型FLOPS利用率
相关文章