少点错误 01月02日
Example of GPU-accelerated scientific computing with PyTorch
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何利用GPU加速物理研究中的量子力学计算。作者使用PyTorch将CPU代码移植到GPU,利用其并行计算能力。文章详细解释了CUDA和Metal两种GPU API的工作原理,以及PyTorch如何通过这些API实现GPU加速。虽然CUDA支持的功能比Metal更完善,但作者通过JIT编译成功在Apple GPU上实现了矩阵行列式计算。文章还阐述了在GPU加速中,CPU和GPU之间的协作方式,以及为何使用Python进行GPU计算并不会显著降低速度。最后,作者分享了将串行计算转换为并行计算的实践经验,以及PyTorch在MPS后端实现上的细节。

💻 PyTorch利用GPU加速:作者将物理研究中的线性代数计算移植到GPU上,利用GPU的并行计算能力提高计算速度。

🧮 CUDA与Metal API:CUDA是NVIDIA的API,而Metal是Apple的API,两者在GPU指令和功能支持上存在差异,导致PyTorch在不同GPU后端上的功能支持不一致。

🔄 CPU与GPU协同:CPU负责向GPU发送计算任务和接收结果,而GPU执行大量的并行计算。由于数据传输速度较慢,优化数据传输和计算的整体效率至关重要。

🛠️ MPS后端实现:PyTorch的MPS后端针对Apple M系列GPU,但功能支持不如CUDA完善。作者通过JIT编译,在MPS后端实现了原本不支持的矩阵行列式计算。

🚀 并行计算优化:通过将串行计算任务分解为多个并行任务,并将其发送到GPU执行,可以显著提高计算速度。Python代码在GPU计算中并不一定成为瓶颈,因为CPU仅负责调度任务。

Published on January 1, 2025 11:01 PM GMT

Here's a fun little post I made because a friend asked me how PyTorch had things which were supported in the CUDA backend but not the MPS backend. I was once the sort of person who was on LessWrong, would find the subject interesting, and not already know everything in the post, so I'm posting it here to see if there's enough interest to post stuff like this here in the future. I'm worried that the average person reading LessWrong will find this post too elementary, but I'm trying to post more so I can practice writing before I have something interesting to say. Feedback appreciated.

As part of my physics research, I've been using some C++ code which performs quantum mechanics calculations. I need to do the same set of linear algebra calculations millions of times with independently generated random inputs, and it occurred to me that this sounded like the sort of thing GPUs would be good at. GPUs are designed to perform a lot of simple calculations at the same time. My old code is made to run on CPUs which are good doing a lot of complicated calculations one after another. For those who are unfamiliar, the recent AI revolution is largely the product of researchers turning a prediction algorithm based on linear algebra into many (relatively) smaller calculations that may all be run at the same time by a GPU. If you know how to multiply matrices and want some flavor for how things like ChatGPT are just linear algebra, this page seemed correct to me as of a quick read on Jan 1, 2025. Anyway, the important thing is that I knew that a bunch of engineers had put a lot of effort into making many large linear algebra calculations run quickly at the same time on GPUs, and it seemed a shame not to use that for physics. I've been itching to rewrite my research code to take advantage of modern machine learning tools for months, and I finally started doing it a few weeks ago. I decided to use PyTorch, because I have access to a lot of tools which make rapidly testing Python code easier than testing C++ code, and the research computers at my university already have Torch installed. Python is much much slower than C++ code, but that shouldn't matter for reasons I explain below.

So far it seems like it's working! I think I've figured out how to turn my calculation into operations that PyTorch knows how to batch together, but I need to fix the mathematical details of the code. The CUDA backend has support for parallelizing the discrete Fourier transforms, parallel multiplication, and matrix determinants which make up the mathematical content of my calculation. My parallel code is not actually faster than the old linear code unless you feed it a calculation which in physics terms has many particles or many spatial points to keep track of, but that was the whole point for me. I wanted to be able to get large calculations back faster for testing purposes, even if it makes more sense to use the robust linear code when I perform the calculations that I intend to publish. PyTorch has support for the GPU in my laptop, so I was excited to throw the calculation at my laptop after I showed that it was incredibly fast on the research GPU I tested it on. It didn't work at first. If I turned off GPU acceleration and ran it on the CPU, it worked fine, but PyTorch told me that the function I wanted (matrix determinants!) was not supported on my GPU. My friend was confused when I complained about this, and he asked me why it was possible for PyTorch to do some simple calculations on one backend but not another. The short answer is that CUDA (NVIDIA) is an entirely different API than Metal (Apple), and the Torch team has had longer to rewrite their functions for CUDA than they have for Metal.

CUDA isn’t necessarily machine code, it can be a set of API calls

Machine code is a set of ones and zeros which you can put directly on a processor to make it do useful stuff for you. An API is an interface for two machines to talk to each other. At the level of reality, machine code and API calls are both sequences of voltage shifts that you send on a wire to something. The primary difference is that machine code goes directly onto the registers of a processor and causes the processor to do stuff in order to give you an output. API calls are interpretted by a processor into machine code which then runs on some processor that gives you an output. It turns out that while the CUDA compiler can create machine code for NVIDIA GPUs, the code would be specific to a single type of GPU, and so CUDA also includes an API to make generalized code which talk to any of its GPUs via drivers. Metal is the API which Apple provides which allows you to make CPU code which can talk to any of its GPUs. When you make code using CUDA or Metal in their API forms, you run the code through a compiler which generates CPU machine code which makes API calls to GPU drivers which send machine code to the GPU. The machine code also needs to interpret the output of the GPUs into the answers to the calculations that you wanted. Moving data back and forth is much slower than calculation, so in practice, the output will often actually be the machine equivalent of "ok I did that and I'm holding onto the answer, so what should I do with it now?" and then the CPU and GPU go back and forth a few times and the answer only goes back to the CPU where it can be viewed once the entire calculation is complete. I assume that the PyTorch team doesn't want to have many different versions of PyTorch each compiled for every possible combination of popular CPU and GPU, so they use GPU APIs. This means that you install the version of PyTorch that works with your CPU (in practice, just install PyTorch with pip, and it will use its knowledge of what CPU it runs on to grab the right version of PyTorch), and that program will be able to talk to any NVIDIA or Apple GPU, including ones which come out after the version of PyTorch you're using was created.

The PyTorch extension works within Python, which is a program which turns lines of text into machine code that feeds into the processor one line at a time (this program is an example of an "interpreter"). This is not a compiler, which takes a whole lot of lines of text and finds an efficient way to combine them all into machine code at once. Python code tends to run much slower than CPU code, because interpreting lines of code into machine code on the fly is slower than running all of the code at once when you already compiled it ahead of time. The CPU backend for PyTorch has some algorithms precompiled to run faster on processors given a single Python command, and the compiler they used can turn source code into instructions for any supported CPU. That’s what compilers are for, so the CPU backend just works everywhere that PyTorch can run at all. CUDA is a set of instructions which allow CPUs to tell NVIDIA GPUs what to do, and torch wraps up a bunch of GPU instructions which do certain tasks into functions which python can use, but they specifically use the API version of CUDA, so you can’t send them to arbitrary GPUs. Metal is the API available for Apple M series GPUs, and it’s probably possible to rewrite everything for them that works in CUDA, but it’s not like there’s drop-in replacements between CUDA and Metal, so each function in PyTorch which knows how to make CUDA calls has to be rewritten to make Metal calls instead. This is implemented as PyTorch's MPS backend, which either performs tasks on M series GPUs or apologetically tells you that the task isn't supported yet.

Talking about me again

That whole thing about how must of the time a CPU is sitting around waiting for the GPU to say it's ready for the next step is why I think I can use Python code for my enormous calculations. If everything I send to the GPU is bundled up into such a large task that it takes a second to run, then it doesn't matter whether I use C++ code which can run a million lines of code per second or Python code which can only run a hundred lines of code per second. I only need the CPU to talk to the GPU once per second. I found this page helpful when I was thinking about how to effectively accelerate calculations with GPUs. Based off of my testing so far, I think I can do things like wrap up 2000 sets of 4 subtasks which used to take a minute when I ran them all linearly on a CPU into four batches of 2000 tasks, but each batch of 2000 is sent to a GPU at once to be performed in parallel which takes a few seconds. Then I can run a calculation which used to take a minute in ten seconds or whatever.

Unfortunately, PyTorch has, as far as I can tell, implemented the MPS backend based on what people actually use (or rather which things people have asked them to make available for MPS) rather than whether it was easy to implement functions based on the functions they already implemented. As a funny example, they support the function which does an LU decomposition and returns it in one array that looks like a matrix, but not the function which does an LU decomposition and returns it as literally the exact same numbers split up between two separate matrices with zeros in all of the extra slots. I doubt there’s any difference between those algorithms mathematically or on a processor, but I assume that formatting the arrays is nonzero effort and writing in all of the optional flags available to each function takes more effort. It took me literally three minutes to turn this LU function that was supported on my Mac GPU into a JIT compiled determinant function which worked on my GPU, even though the native determinant function wasn't supported in the MPS backend. I won't actually use that function because it doesn't support complex numbers, and I think I can accelerate enough of the rest of my calculation that running the matrix determinants on CPU won't slow me down much. I can even write my code so that I can get the next GPU task going while my CPU chews on matrix determinants.

I thought other people might be interested in some of the details of how modern machine learning tools could be used for scientific research outside of the machine learning regime.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

PyTorch GPU加速 CUDA Metal 并行计算
相关文章