Nvidia Developer 02月16日
OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA与OpenAI合作,通过Triton编译器支持NVIDIA Blackwell架构,使开发者能轻松利用Blackwell的最新特性。Blackwell架构在计算能力和架构创新方面都有显著提升,Triton编译器在矩阵乘法和新型精度格式两方面进行了优化。通过扩展MMA流水线,Triton自动利用Blackwell的新Tensor Core,在FP8和FP16 GEMM运算上实现了卓越性能。Flash attention在Blackwell上也有显著加速,FP16 attention速度提升高达1.5倍。此外,Triton还解锁了Blackwell的块缩放浮点格式,为LLM推理项目提供更高性能和精度。

🚀 **矩阵乘法性能提升:** Triton编译器通过扩展MMA流水线,自动利用NVIDIA Blackwell架构的新Tensor Core,显著提升了FP8和FP16 GEMM运算的性能,与库实现相比,在多个关键用例中实现了接近最优的性能。

💡 **Flash Attention加速:** Flash Attention是现代Transformer架构中的关键组成部分,在NVIDIA Blackwell架构上通过Triton编译器实现了显著的加速,FP16 Attention速度相比NVIDIA Hopper GPU架构提升高达1.5倍,且无需修改现有Triton Flash Attention代码。

✨ **新型精度格式支持:** NVIDIA Blackwell架构引入了革命性的块缩放浮点格式,包括开放计算项目的微缩放格式,Triton现在可以解锁这些格式,从而加速NVIDIA Blackwell驱动的硬件。MXFP8 GEMM在Triton上的表现与FP8 GEMM类似,同时允许在Tensor Core中进行缩放。MXFP4则在精度和性能之间提供了一个新的平衡点,其硬件加速性能是FP8和MXFP8 GEMM的两倍。

Matrix multiplication and attention mechanisms are the computational backbone of modern AI workloads. While libraries like NVIDIA cuDNN provide highly optimized implementations, and frameworks such as CUTLASS offer deep customization, many developers and researchers need a middle ground that combines performance with programmability. The open-source Triton compiler on the NVIDIA Blackwell architecture addresses this need by exposing Blackwell’s advanced features through an intuitive programming model.As a result of NVIDIA’s ongoing collaboration with OpenAI, the Triton compiler now supports the NVIDIA Blackwell architecture. This ensures that developers and researchers can use the latest and greatest features from Blackwell architecture easily from the comfort of a Python-based compiler such as Triton.Performance advances on NVIDIA BlackwellThe NVIDIA Blackwell architecture introduces substantial improvements in both raw computing power and architectural innovations. NVIDIA’s collaboration with OpenAI has focused on leveraging these capabilities transparently through Triton’s compiler infrastructure, particularly in two key areas:Matrix multiplications including flash attentionNew precision formatsMatrix multiplicationsThe NVIDIA Blackwell architecture adds a brand-new Tensor Core designed from the ground up for improved throughput and energy efficiency. By extending Triton’s Matrix Multiply-Accumulate (MMA) pipelining machinery, we’ve enabled automatic exploitation of NVIDIA Blackwell’s new Tensor Cores. This required careful analysis of memory access patterns and sophisticated compiler transformations to ensure correct and efficient compute / data-movement overlap. The result is exceptional performance for both FP8 and FP16 GEMM operations out of the box, with these optimizations automatically applying to any kernel using Triton’s tl.dot primitive. Overall, Triton manages to achieve near-optimal performance, comparable to library implementations across several critical use cases.Figure 1.  Performance improvements with Triton on NVIDIA BlackwellFigure 1 shows that Triton optimizations on NVIDIA Blackwell architecture bring hardware performance improvements to users in both FP16 and FP8 in this K sweep analysis for a typical generative AI size of GEMM kernel, as provided in the Triton tutorials. Flash attentionFlash attention, a crucial primitive in modern transformer architectures, sees significant speedups on NVIDIA Blackwell through Triton, with up to 1.5x for FP16 attention over the NVIDIA Hopper GPU architecture. While we continue to optimize absolute performance through ongoing compiler enhancements on FP8 and other precisions, the current work helps customers readily transition to NVIDIA Blackwell on Day 0 for existing products. Another important aspect to note here is the ability to deliver this performance gain “for free” with existing Triton flash attention implementations, requiring no code changes.Figure 2. Large performance gains for more complex workloadsFigure 2 shows that more complex workloads, such as the flash attention kernel as provided in the Triton tutorials, again show the large performance gains of the NVIDIA Blackwell architecture when unlocked with Triton compiler improvements. Some improvements from this work have improved NVIDIA Hopper attention performance as well and don’t show up in this data.New precision formatsNVIDIA Blackwell introduces revolutionary block-scaled floating point formats, including the Open Computing Project’s microscaling formats, which Triton now unlocks for NVIDIA Blackwell-powered hardware acceleration. These formats provide higher average precision at higher performance than the non-native block-scaling techniques emulated frequently in LLM inference projects today.For OCP format support, MXFP8 GEMMs on Triton showcase exceptional performance similar to the FP8 GEMMs performance accelerated and shown earlier in this post, while natively allowing for scaling in the Tensor Core. Similarly, MXFP4 provides a new operating point in the precision-performance trade-off space but while offering double the hardware-accelerated performance of FP8 and MXFP8 GEMMs.To learn more about the new block-scaled floating point support, take a look at the new Triton tutorial dedicated to this functionality.Areas of improvement going forwardThe layout and packing of sub-byte datatype formats like MXFP4 still require care by the end user. We look forward to working with the community to improve the ergonomics for kernel authors and seamless framework integrations. The earlier-referenced matrix multiplication kernels across all data types still achieve relatively low utilization when GEMM_K is small. This can be mitigated through manual sub-tiling in the kernel itself and has been implemented in the GEMM tutorials as an example. It will eventually be addressed transparently in the compiler through automatic warp-specialization.More informationPhillippe Tillet, the creator of Triton, and NVIDIA will be diving into the details of this NVIDIA Blackwell work and the resulting performance at the NVIDIA GTC conference on March 17. Register to attend GTC 2025 virtually or or attend live. This release establishes a powerful foundation for NVIDIA Blackwell support in Triton—but it’s just the beginning. Here’s how you can help shape what’s next:Start building with Triton on NVIDIA Blackwell today and unlock the full potential of NVIDIA’s latest architecture while maintaining complete control over your development.Have ideas or encountered issues? Contact our NVIDIA product manager Matthew Nicely by tagging him on GitHub.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Triton编译器 NVIDIA Blackwell 矩阵乘法 Flash Attention
相关文章