OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability

Matrix multiplication and attention mechanisms are the computational backbone of modern AI workloads. While libraries like NVIDIA cuDNN provide highly optimized implementations, and frameworks such as CUTLASS offer deep customization, many developers and researchers need a middle ground that combines performance with programmability. The open-source Triton compiler on the NVIDIA Blackwell architecture addresses this need by exposing Blackwell’s advanced features through an intuitive programming model.As a result of NVIDIA’s ongoing collaboration with OpenAI, the Triton compiler now supports the NVIDIA Blackwell architecture. This ensures that developers and researchers can use the latest and greatest features from Blackwell architecture easily from the comfort of a Python-based compiler such as Triton.Performance advances on NVIDIA BlackwellThe NVIDIA Blackwell architecture introduces substantial improvements in both raw computing power and architectural innovations. NVIDIA’s collaboration with OpenAI has focused on leveraging these capabilities transparently through Triton’s compiler infrastructure, particularly in two key areas:Matrix multiplications including flash attentionNew precision formatsMatrix multiplicationsThe NVIDIA Blackwell architecture adds a brand-new Tensor Core designed from the ground up for improved throughput and energy efficiency. By extending Triton’s Matrix Multiply-Accumulate (MMA) pipelining machinery, we’ve enabled automatic exploitation of NVIDIA Blackwell’s new Tensor Cores. This required careful analysis of memory access patterns and sophisticated compiler transformations to ensure correct and efficient compute / data-movement overlap. The result is exceptional performance for both FP8 and FP16 GEMM operations out of the box, with these optimizations automatically applying to any kernel using Triton’s tl.dot primitive. Overall, Triton manages to achieve near-optimal performance, comparable to library implementations across several critical use cases.Figure 1. Performance improvements with Triton on NVIDIA BlackwellFigure 1 shows that Triton optimizations on NVIDIA Blackwell architecture bring hardware performance improvements to users in both FP16 and FP8 in this K sweep analysis for a typical generative AI size of GEMM kernel, as provided in the Triton tutorials. Flash attentionFlash attention, a crucial primitive in modern transformer architectures, sees significant speedups on NVIDIA Blackwell through Triton, with up to 1.5x for FP16 attention over the NVIDIA Hopper GPU architecture. While we continue to optimize absolute performance through ongoing compiler enhancements on FP8 and other precisions, the current work helps customers readily transition to NVIDIA Blackwell on Day 0 for existing products. Another important aspect to note here is the ability to deliver this performance gain “for free” with existing Triton flash attention implementations, requiring no code changes.Figure 2. Large performance gains for more complex workloadsFigure 2 shows that more complex workloads, such as the flash attention kernel as provided in the Triton tutorials, again show the large performance gains of the NVIDIA Blackwell architecture when unlocked with Triton compiler improvements. Some improvements from this work have improved NVIDIA Hopper attention performance as well and don’t show up in this data.New precision formatsNVIDIA Blackwell introduces revolutionary block-scaled floating point formats, including the Open Computing Project’s microscaling formats, which Triton now unlocks for NVIDIA Blackwell-powered hardware acceleration. These formats provide higher average precision at higher performance than the non-native block-scaling techniques emulated frequently in LLM inference projects today.For OCP format support, MXFP8 GEMMs on Triton showcase exceptional performance similar to the FP8 GEMMs performance accelerated and shown earlier in this post, while natively allowing for scaling in the Tensor Core. Similarly, MXFP4 provides a new operating point in the precision-performance trade-off space but while offering double the hardware-accelerated performance of FP8 and MXFP8 GEMMs.To learn more about the new block-scaled floating point support, take a look at the new Triton tutorial dedicated to this functionality.Areas of improvement going forwardThe layout and packing of sub-byte datatype formats like MXFP4 still require care by the end user. We look forward to working with the community to improve the ergonomics for kernel authors and seamless framework integrations. The earlier-referenced matrix multiplication kernels across all data types still achieve relatively low utilization when GEMM_K is small. This can be mitigated through manual sub-tiling in the kernel itself and has been implemented in the GEMM tutorials as an example. It will eventually be addressed transparently in the compiler through automatic warp-specialization.More informationPhillippe Tillet, the creator of Triton, and NVIDIA will be diving into the details of this NVIDIA Blackwell work and the resulting performance at the NVIDIA GTC conference on March 17. Register to attend GTC 2025 virtually or or attend live. This release establishes a powerful foundation for NVIDIA Blackwell support in Triton—but it’s just the beginning. Here’s how you can help shape what’s next:Start building with Triton on NVIDIA Blackwell today and unlock the full potential of NVIDIA’s latest architecture while maintaining complete control over your development.Have ideas or encountered issues? Contact our NVIDIA product manager Matthew Nicely by tagging him on GitHub.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签