MarkTechPost@AI 2024年10月04日
Mirage: A Multi-Level Tensor Algebra Super-Optimizer that Automates GPU Kernel Generation for PyTorch Applications
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Mirage是卡内基梅隆大学开发的工具,旨在自动生成高性能GPU内核。它能简化工程师的工作,生成的内核比最佳人工编写代码更快,还能降低整体延迟。Mirage可直接用于PyTorch张量,且代码行数较传统方法少。文章还介绍了GPU的相关知识及四种GPU优化类别。

🎯Mirage是一款创新工具,能自动搜索并生成高性能GPU内核,可直接用于PyTorch张量,在PyTorch程序中调用,且用户只需编写少量代码。

💪Mirage生成的内核速度快,比最佳人工编写代码快1.2x - 2.5x,将其集成到PyTorch中可降低整体延迟15 - 20%。

📚GPU的计算围绕内核展开,具有特定的内存层次结构,包括寄存器文件、共享内存和设备内存。其架构通过uGraph表示,包含内核、线程块和线程等多个层级。

🌟文章提到四种GPU优化类别,如Normalization + Linear、LoRA + Linear、Gated MLP、Attention variants,这些优化在当今的ML系统中大多缺失。

With the increasing growth of artificial intelligence—introduction of large language models (LLMs) and generative AI—there has been a growing demand for more efficient graphics processing units (GPUs). GPUs are specialized hardware extensively used for high computing tasks and capable of executing computations in parallel. Writing proper GPU kernels is important to utilize GPUs to their full potential. This task is quite time-consuming and complex, requiring deep expertise in GPU architecture and some programming languages like C++, CUDA, etc. 

Machine Learning ML compilers like TVM, Triton, and Mojo provide certain automation but still need manual handling of the GPU kernels to obtain the optimal result. To achieve optimal results and avoid manual tasking, researchers at Carnegie Mellon University have developed Mirage, an innovative tool designed to automate the generation of high-performance GPU kernels by searching for and generating them. The kernels generated by Mirage can directly be used on PyTorch tensors and be called in PyTorch programs. Users need to write a few lines of code in Mirage compared to the traditional script, which uses many lines. 

Mirage can be seen as a future changer, attaining high productivity, better performance, and stronger correctness in AI applications. Writing manual codes requires substantial engineering expertise due to the complex nature of GPU architecture, but Mirage simplifies the process by automatically generating kernels, easing and simplifying the tasks for engineers. 

Manually written GPU kernels might have some errors, which makes it hard to achieve the required results, but research on Mirage has shown that kernels generated by Mirage are 1.2x-2.5x times faster than the best human-written code. Also, integrating Mirage into PyTorch reduces overall latency by 15-20%. 

      # Use Mirage to generate GPU kernels for attention      import mirage as mi      graph = mi.new_kernel_graph()      Q = graph.new_input(dims=(64, 1, 128), dtype=mi.float16)      K = graph.new_input(dims=(64, 128, 4096), dtype=mi.float16)      V = graph.new_input(dims=(64, 4096, 128), dtype=mi.float16)      A = graph.matmul(Q, K)      S = graph.softmax(A)      O = graph.matmul(S, V)      optimized_graph = graph.superoptimize()

Code in Mirage takes few lines compared to traditional method with many lines

All the computations in GPUs are centered around kernels, which are functions running parallely around multiple streaming multiprocessors (SM) in a single-program-multiple data fashion (SPMD). Kernels organize computation in a grid of thread blocks, with each thread block running on a single SM. Each block further has multiple threads to perform calculations on individual data elements.

GPU follows a particular memory hierarchy with:

The architecture is represented with the help of the uGraph representation, which contains graphs on multiple levels: Kernel level, thread block level and thread level with kernel-level encapsulating computation over the entire GPU, thread block level addressing computation on an individual streaming multiprocessor (SM), and thread graph addressing computation at the CUDA or tensor core level. The uGraph provides a structured way to represent GPU computations.

Four Categories of GPU Optimization:

1. Normalization + Linear

LLMs generally use LayernNorm, RMSNorm, GroupNorm, and BatchNorm techniques, which are often treated separately by ML compilers. This separation is because normalization techniques require both reduction and broadcast operations. These normalization layers can be fused with linear ones by matrix multiplication. 

2. LoRA + Linear

It fuses low-rank adaptation (LoRA), a technique to adapt pre-trained models to new tasks or datasets while reducing computational requirements with linear layers. It is 1.6x faster than the existing systems. 

3. Gated MLP

It combines two MatMuls, SiLU activation, and element-wise multiplication. Gated MLP reduces kernel launch overhead and device memory access to 1.3x faster than the best baseline. 

4. Attention variants

a. Query-Key Normalization 

Chameleon, ViT-22B, and Google’s recent paper have introduced query-key normalization and fused LayerNorm into the attention kernel. This custom kernel also performs existing GPU optimizations tailored for attention with a 1.7x-2.5x performance improvement. 

      Four categories of GPU Optimization that are mostly missing in today’s ML systems

b. Multi-Head Latent Attention 

It optimizes memory usage by compressing traditional key-value cache of attention into a more compact latent vector. This change introduces two linear layers before attention. Mirage generates a custom kernel that integrates the linear layers with the attention mechanism into a single kernel. This prevents storing intermediate key-value vectors in the GPU device memory. 

In conclusion, Mirage addresses the critical challenge of dealing with high GPU kernels in advanced artificial intelligence problems. It eliminates the problems of significant time investment, high coding expertise, and error generation by providing the best optimal GPU kernels that work in a PyTorch-based environment. It also deals with the loopholes that manual computing might miss, accelerating the deployment of LLMs and other AI technologies across real-world applications.


Check out the GitHub page and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!

The post Mirage: A Multi-Level Tensor Algebra Super-Optimizer that Automates GPU Kernel Generation for PyTorch Applications appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Mirage GPU内核 优化类别 PyTorch
相关文章