Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels

MarkTechPost@AI 前天 15:45

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Meta发布了KernelLLM，这是一个基于Llama 3.1 Instruct微调的80亿参数语言模型，旨在自动将PyTorch模块转换为高效的Triton GPU内核。该模型通过简化内核开发流程，降低了GPU编程的门槛。KernelLLM在包含约25,000个PyTorch模块及其对应Triton内核实现的配对示例上进行训练，数据集名为KernelBook。在KernelBench-Triton基准测试中，KernelLLM的Pass@1得分为20.2，优于GPT-4o和DeepSeek V3等更大的模型，表明其在生成正确内核方面表现出色。

🚀 Meta发布KernelLLM，一个80亿参数的语言模型，它基于Llama 3.1 Instruct进行微调，专注于自动将PyTorch模块转换成高效的Triton GPU内核，旨在简化GPU编程。

📚 KernelLLM的训练数据集KernelBook包含约25,000个PyTorch模块及其对应的Triton内核实现，这些数据来源于过滤后的The Stack代码以及使用torch.compile()等技术合成生成的数据。

📊 在KernelBench-Triton基准测试中，KernelLLM的Pass@1得分为20.2，优于GPT-4o（~200B参数）和DeepSeek V3（671B参数），后两者的得分分别为15和16。通过多次推断，KernelLLM的Pass@10和Pass@20得分分别达到51.8和57.1，显示出强大的内核生成能力。

💡 KernelLLM通过自动生成Triton内核，有望简化GPU加速应用的开发流程，使开发者无需深入研究复杂的手动内核编程，即可优化性能，从而更高效地利用GPU资源，潜在影响深度学习模型的训练和推理等领域。

Meta has introduced KernelLLM, an 8-billion-parameter language model fine-tuned from Llama 3.1 Instruct, aimed at automating the translation of PyTorch modules into efficient Triton GPU kernels. This initiative seeks to lower the barriers to GPU programming by simplifying kernel development processes.

Technical Overview

KernelLLM is trained on approximately 25,000 paired examples of PyTorch modules and their corresponding Triton kernel implementations. The dataset, known as KernelBook, comprises filtered code from The Stack and synthetically generated samples using torch.compile() and other prompting techniques.

The model employs a supervised instruction tuning approach, utilizing prompt templates that include format examples during both training and evaluation. Training was conducted over 10 epochs with a batch size of 32, using 16 GPUs over approximately 12 hours (192 GPU hours).

Performance Evaluation

KernelLLM’s performance was assessed using KernelBench-Triton, a benchmark designed to evaluate the generation of Triton kernels from PyTorch modules. The model achieved a Pass@1 score of 20.2, outperforming larger models such as GPT-4o (~200B parameters) and DeepSeek V3 (671B parameters), which scored 15 and 16 respectively. With multiple inferences, KernelLLM’s Pass@10 and Pass@20 scores reached 51.8 and 57.1, indicating robust performance in generating correct kernels.

Implications for GPU Programming

By automating the generation of Triton kernels from PyTorch modules, KernelLLM has the potential to streamline the development of GPU-accelerated applications. This could be particularly beneficial for developers seeking to optimize performance without delving into the complexities of manual kernel programming.

The model’s ability to produce efficient kernels may also contribute to more accessible and efficient utilization of GPU resources, potentially impacting areas such as deep learning model training and inference.

Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post Meta Introduces KernelLLM: An 8B LLM that Translates PyTorch Modules into Efficient Triton GPU Kernels appeared first on MarkTechPost.

Technical Overview

Performance Evaluation

Implications for GPU Programming

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签