Nvidia Developer 02月16日
Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

英伟达工程师探索了一种利用AI模型DeepSeek-R1在推理时分配额外计算资源的新方法,以自动生成优化的GPU attention内核。通过闭环方式,模型根据验证器的反馈不断改进生成的内核代码。实验结果表明,这种方法在解决KernelBench基准测试中的问题时,表现甚至优于经验丰富的工程师手动开发的内核。该技术通过模仿人类解决复杂问题的方式,使AI能够更有策略地解决问题,为AI推理和代码优化领域带来了新的可能性。

💡通过在推理过程中分配额外的计算资源,AI模型能够评估多种可能的方案,并选择最佳方案,从而提升模型性能,这种技术也被称为AI推理或长时间思考。

⚙️英伟达工程师利用开源模型DeepSeek-R1,结合推理时额外的计算能力,成功地自动生成了GPU attention内核,这些内核在数值上是正确的,并且针对不同的attention变体进行了优化,无需显式编程。

⏱️为了获得最佳结果,英伟达工程师创建了一个新的工作流程,该流程在推理过程中包含一个特殊的验证器以及DeepSeek-R1模型,以闭环方式运行一段预定的时间。实验表明,持续运行15分钟可以生成改进的attention内核。

📊实验结果表明,该工作流程在Stanford的KernelBench基准测试中,Level-1问题的解决率为100%,Level-2问题的解决率为96%。这表明,通过增加推理时间预算,可以显著提高工作流程生成数值正确的代码的能力。

As AI models extend their capabilities to solve more sophisticated challenges, a new scaling law known as test-time scaling or inference-time scaling is emerging. Also known as AI reasoning or long-thinking, this technique improves model performance by allocating additional computational resources during inference to evaluate multiple possible outcomes and then selecting the best one, neural network. This enables AI to strategize and systematically solve complex problems in a similar fashion to how humans dissect complex problems and solve them individually to arrive at a final solution.In this post, we talk about an experiment done by NVIDIA engineers who used one of the newest open-source models, the DeepSeek-R1 model, together with additional computing power during inference to solve a complex problem. The experiment was to automatically generate GPU attention kernels that were numerically correct and optimized for different flavors of attention without any explicit programming. The results turned out to be better than the optimized kernels developed by skilled engineers in some cases. The need for optimized attention kernels and associated challengesAttention is a key concept that revolutionized the development of the large language model (LLM). It’s a powerful mechanism that enables AI models to focus selectively on the most relevant parts of input when performing tasks. By focusing on important information, the attention operation helps the models make better predictions and find hidden patterns in the data. The computational complexity of the attention operation grows quadratically in relation to the input sequence length. This motivates the need for developing an optimized lower-level implementation (that is, a GPU kernel) to prevent runtime errors arising from simple implementations (for example, out-of-memory errors) and for computational efficiency purposes. There are multiple variants of attention (causal, relative positional embeddings, alibi, and so on) and often engineers must use a combination of these variants for a given task. ‌Multi-modal models (for example, vision transformers) introduce an additional layer of challenges as they require specialized attention mechanisms (Spatial Neighborhood Attention) for maintaining spatio-temporal information often encountered in computer vision, video generation models, and so on. Figure 1. Neighborhood attention on 2D inputsCreating an optimized GPU kernel for attention takes a lot of skill and time, even for experienced software engineers. ‌Recent LLMs like DeepSeek-R1 have shown a lot of promise in code generation tasks, but they still face challenges creating optimized code on the first try. This makes it necessary to use other strategies at inference time to generate optimized code. The following prompt is sample user input for a relative positional embeddings attention kernel.Please write a GPU attention kernel to support relative position encodings. Implement the relative positional encoding on the fly within the kernel. The complete code should be returned, including the necessary modifications.Use the following function to compute the relative positional encoding:def relative_positional(score, b, h, q_idx, kv_idx):    return score + (q_idx - kv_idx)When implementing the kernel, keep in mind that a constant scaling factor 1.44269504 should be applied to the relative positional encoding due to qk_scale = sm_scale 1.44269504. The PyTorch reference does not need to scale the relative positional encoding, but in the GPU kernel, use:qk = qk qk_scale + rel_pos * 1.44269504Please provide the complete updated kernel code that incorporates these changes, ensuring that the relative positional encoding is applied efficiently within the kernel operations.LLMs can occasionally produce hallucinated code or mix syntax from different languages or frameworks, causing immediate code errors or inefficiencies. Computing the optimal GPU thread mapping is also non-trivial and a challenging task, often requiring iterative refinement to achieve a correct and efficient kernel.Inference-time scaling for generating optimized GPU KernelsTo get the best results with optimized attention kernels, NVIDIA engineers created a new workflow that includes a special verifier along with the DeepSeek-R1 model during inference in a closed-loop fashion for a predetermined duration. Figure 2. Inference-time scaling with DeepSeek-R1 on the NVIDIA Hopper platformThe workflow is first initialized by a manual prompt and the DeepSeek-R1 model generates the GPU code (that is, the kernel) in the first pass. The verifier runs on an NVIDIA H100 GPU. It analyzes the generated kernel and creates new prompts that are provided as ‌input to the DeepSeek-R1 model. This closed-loop approach makes the code generation process better by guiding it in a different way each time. The team found that by letting this process continue for 15 minutes resulted in an improved attention kernel. Figure 3. Performance of automatically generated optimized attention kernels with flex attentionThis workflow produced numerically correct kernels for 100% of Level-1 problems and 96% of Level-2 problems, as tested by Stanford’s KernelBench benchmark. ‌The Level-1 solving rate in KernelBench refers to the numerical correct metric used to evaluate the ability of LLMs to generate efficient GPU kernels for specific computational tasks. This test is part of a series of challenges to test the latest LLMs’ abilities in GPU programming.Figure 4 shows how the inference-time budget affects the agent’s solving rate. Allocating more than 10 minutes per problem in the Level-1 category enables the workflow to produce numerical correct code for most of the 100 problems.Figure 4. Inference-time scaling results in optimized GPU kernelsOptimized GPU kernels on DeepSeek-R1These results show how you can use the latest DeepSeek-R1 model to give better GPU kernels by using more computing power during inference time. This is still a new research area with early results on a promising approach that automatically generates effective attention kernels. While we are off to a good start, more work is needed to generate better results consistently for a wider variety of problems. We’re excited about the recent developments in DeepSeek-R1 and its potential. For more information or to get started, see the DeepSeek-R1 NIM microservice, now available on build.nvidia.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DeepSeek-R1 GPU内核 AI推理 attention机制 代码生成
相关文章