Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

As AI models extend their capabilities to solve more sophisticated challenges, a new scaling law known as test-time scaling or inference-time scaling is emerging. Also known as AI reasoning or long-thinking, this technique improves model performance by allocating additional computational resources during inference to evaluate multiple possible outcomes and then selecting the best one, neural network. This enables AI to strategize and systematically solve complex problems in a similar fashion to how humans dissect complex problems and solve them individually to arrive at a final solution.In this post, we talk about an experiment done by NVIDIA engineers who used one of the newest open-source models, the DeepSeek-R1 model, together with additional computing power during inference to solve a complex problem. The experiment was to automatically generate GPU attention kernels that were numerically correct and optimized for different flavors of attention without any explicit programming. The results turned out to be better than the optimized kernels developed by skilled engineers in some cases. The need for optimized attention kernels and associated challengesAttention is a key concept that revolutionized the development of the large language model (LLM). It’s a powerful mechanism that enables AI models to focus selectively on the most relevant parts of input when performing tasks. By focusing on important information, the attention operation helps the models make better predictions and find hidden patterns in the data. The computational complexity of the attention operation grows quadratically in relation to the input sequence length. This motivates the need for developing an optimized lower-level implementation (that is, a GPU kernel) to prevent runtime errors arising from simple implementations (for example, out-of-memory errors) and for computational efficiency purposes. There are multiple variants of attention (causal, relative positional embeddings, alibi, and so on) and often engineers must use a combination of these variants for a given task. ‌Multi-modal models (for example, vision transformers) introduce an additional layer of challenges as they require specialized attention mechanisms (Spatial Neighborhood Attention) for maintaining spatio-temporal information often encountered in computer vision, video generation models, and so on. Figure 1. Neighborhood attention on 2D inputsCreating an optimized GPU kernel for attention takes a lot of skill and time, even for experienced software engineers. ‌Recent LLMs like DeepSeek-R1 have shown a lot of promise in code generation tasks, but they still face challenges creating optimized code on the first try. This makes it necessary to use other strategies at inference time to generate optimized code. The following prompt is sample user input for a relative positional embeddings attention kernel.Please write a GPU attention kernel to support relative position encodings. Implement the relative positional encoding on the fly within the kernel. The complete code should be returned, including the necessary modifications.Use the following function to compute the relative positional encoding:def relative_positional(score, b, h, q_idx, kv_idx): return score + (q_idx - kv_idx)When implementing the kernel, keep in mind that a constant scaling factor 1.44269504 should be applied to the relative positional encoding due to qk_scale = sm_scale 1.44269504. The PyTorch reference does not need to scale the relative positional encoding, but in the GPU kernel, use:qk = qk qk_scale + rel_pos * 1.44269504Please provide the complete updated kernel code that incorporates these changes, ensuring that the relative positional encoding is applied efficiently within the kernel operations.LLMs can occasionally produce hallucinated code or mix syntax from different languages or frameworks, causing immediate code errors or inefficiencies. Computing the optimal GPU thread mapping is also non-trivial and a challenging task, often requiring iterative refinement to achieve a correct and efficient kernel.Inference-time scaling for generating optimized GPU KernelsTo get the best results with optimized attention kernels, NVIDIA engineers created a new workflow that includes a special verifier along with the DeepSeek-R1 model during inference in a closed-loop fashion for a predetermined duration. Figure 2. Inference-time scaling with DeepSeek-R1 on the NVIDIA Hopper platformThe workflow is first initialized by a manual prompt and the DeepSeek-R1 model generates the GPU code (that is, the kernel) in the first pass. The verifier runs on an NVIDIA H100 GPU. It analyzes the generated kernel and creates new prompts that are provided as ‌input to the DeepSeek-R1 model. This closed-loop approach makes the code generation process better by guiding it in a different way each time. The team found that by letting this process continue for 15 minutes resulted in an improved attention kernel. Figure 3. Performance of automatically generated optimized attention kernels with flex attentionThis workflow produced numerically correct kernels for 100% of Level-1 problems and 96% of Level-2 problems, as tested by Stanford’s KernelBench benchmark. ‌The Level-1 solving rate in KernelBench refers to the numerical correct metric used to evaluate the ability of LLMs to generate efficient GPU kernels for specific computational tasks. This test is part of a series of challenges to test the latest LLMs’ abilities in GPU programming.Figure 4 shows how the inference-time budget affects the agent’s solving rate. Allocating more than 10 minutes per problem in the Level-1 category enables the workflow to produce numerical correct code for most of the 100 problems.Figure 4. Inference-time scaling results in optimized GPU kernelsOptimized GPU kernels on DeepSeek-R1These results show how you can use the latest DeepSeek-R1 model to give better GPU kernels by using more computing power during inference time. This is still a new research area with early results on a promising approach that automatically generates effective attention kernels. While we are off to a good start, more work is needed to generate better results consistently for a wider variety of problems. We’re excited about the recent developments in DeepSeek-R1 and its potential. For more information or to get started, see the DeepSeek-R1 NIM microservice, now available on build.nvidia.com.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签