Artificial Fintelligence 2024年10月22日
How is LLaMa.cpp possible?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了LLaMa.cpp如何在多种硬件上运行,解释了内存带宽是限制因素,通过量化等方式可降低内存需求,还分析了在不同硬件上的运行情况及性能表现。

LLaMa.cpp将LLaMa推理代码用原始C++重写,通过一些优化和量化权重,可在多种硬件上本地运行。如在Pixel5上,7B参数模型的运行速度为1令牌/秒;在M2 Macbook Pro上,7B参数模型可达约16令牌/秒;在4GB RAM的Raspberry Pi上,7B模型虽为0.1令牌/秒,但也可运行。

GPU对深度学习有两大益处:大量内存带宽和大量计算能力。内存带宽用于将数据从HBM内存移到片上内存,其比计算性能小约2个数量级,往往是推理的瓶颈。

文章通过一系列计算和分析,得出在运行LLaMa推理时的主要要求,包括在内存中保存KV缓存和所有参数,将权重从HBM读入片上内存,进行实际的矩阵乘法以计算网络输出等。

以不同硬件为例分析了LLaMa的运行情况。如在A100上,65B模型预计约30令牌/秒,7B模型约277令牌/秒;M1 GPU上,65B模型采样上限约为1令牌/秒,7B模型为10令牌/秒;Raspberry Pi 4上,若为内存限制,7B模型预计约2令牌/秒,实际约0.1令牌/秒,可能为计算限制。

How is LLaMa.cpp possible?

Note: Substack doesn’t have great support for LaTeX, so you might want to read this article on my blog instead.

Recently, a project rewrote the LLaMa inference code in raw C++. With some optimizations and by quantizing the weights, the project allows running LLaMa locally on a wild variety of hardware:

If you are like me, you saw this and thought: What? How is this possible? Don’t large models require expensive GPUs? I took my confusion and dove into the math surrounding inference requirements to understand the constraints we’re dealing with.

Subscribe now

Let’s start with GPUs. GPUs have two main benefits for deep learning:

    They have a large amount of memory bandwidth (A100: 1935 GB/s, 4090: 1008 GB/s)

    They have a large amount of compute (A100: 312 TFLOPS of FP16, 4090: 82.6 TFLOPS of FP16)

When we talk about memory bandwidth, we’re talking about how long it takes to move things from the HBM memory (i.e. the RAM) into the on-chip memory. To actually do math with the GPU, we need to move the matrices in question into the on-chip memory, which is quite small (40MB on an A100, compared to 40-80GB of RAM). The memory bandwidth is ~2 orders of magnitude smaller than the compute performance— this will matter later, as the memory bandwidth tends to be the bottleneck for inference.

What does this mean in the context of serving LLaMa? Let’s start with some inference arithmetic. We can do some rough calculations on the inference performance of a LLM using Kipply’s article.1 First, some notation on the dimensions of the model:

This gives us a handy equation for the number of parameters in a GPT-style model:2

For the duration of the post, I’m going to focus on the case where we’re running a ChatGPT style service locally, which is what LLaMa.cpp does, letting me assume a batch size of 1.

For efficient inference, the KV cache has to be stored in memory; the KV cache requires storing the KV values for every layer, which is equal to storing:

I use n_bytes here to indicate the number of bytes per param; for float32s, this is 4, for float16s, this is 2, etc. The 2 in the middle is because we have to store one set of weights for the K values, and one for the Vs.

Given a model with n layers, the total memory for the KV cache is:

In addition to storing the KV cache in memory, we also need to store the weights themselves in memory; this requires n_bytes * P bytes.

This is one of the key advantages of quantization. By using less precision, we can radically decrease the amount of memory needed to store our models in memory. Note that, with int4 precision, all of these models fit into memory on an A100 (which is the standard datacenter GPU right now), and all of them, except for the biggest model, fit into memory on high-end consumer GPUs (3090s/4090s, which have 24GB of RAM).

Now, when it comes to actually running inferece, it takes approximately 2P FLOPS per token, because we are doing a bunch of matmuls with a total of P parameters, and multiplying a matrix of size (m, n) with a vector of size (n,) has a cost of 2mn.3

With all that math out of the way, let’s calculate the requirements for running inference with LLaMa. The main requirements when it comes to sampling are:

    Keep the KV cache in memory, in addition to all the parameters.

    Read all the weights from HBM into the on-chip memory. Because we sample auto-regressively, we have to repeat this for each token we sample.

    Do the actual matmuls to calculate the output of our network.

The latency is the maximum of either the compute or the memory latency, as reading parameters into on-chip memory happens asynchronously in all modern tensor programming libraries. As a result, we write:

where B is the batch size. As the memory bandwidth is ~1.935e12, and the number of FLOPS is ~3.12e14, as long as the batch size is less than 161, the model is memory-bound.

With a batch size of 1, this is the same equation, as on most hardware (e.g. Nvidia GPUs), there is a linear speedup as you decrease the precision (you get twice the FLOPS when using fp16 vs fp32, which doubles again as you go to int8, and doubles once more as you go to int4s).

As LLaMa.cpp uses int4s, the RAM requirements are reduced to 1.33GB of memory for the KV cache, and 16.25GB of VRAM for the model parameters. That’s pretty good!

As the memory bandwidth is almost always4 much smaller than the number of FLOPS, memory bandwidth is the binding constraint.

Note that the number of FLOPS/token is identical to the memory bandwidth required, as we have to 1) load all of the parameters into on-chip memory and then 2) use the parameters to compute the results. These happen simultaneously, as all modern tensor programming frameworks are able to handle the “loading into memory” bit asynchronously, so the total time required is max(compute time, memory time).

Running LLaMa on an A100

On an A100 (80GB PCIe), the memory bandwidth is 1935GB/s. The int4 compute is 1248 TOPS. As such, the model is (heavily) memory-bound. We should expect to see roughly 30 tokens/s with the 65B model, and 277 tokens/s with the 7B model.

Running LLaMa on a Macbook

The M1 GPU has a bandwidth of 68.25 GB/s, while the M1 GPU can do up to 5.5 TFLOPS of fp16 compute. As such, we should expect a ceiling of ~1 tokens/s for sampling from the 65B model with int4s, and 10 tokens/s with the 7B model.

As the M2 Pro has 200 GB/s of bandwidth, and the M2 Max has 400 GB/s of bandwidth, we should expect massive improvements with them, going up to 6 tokens/s with the M2 Max with the 65B model. That’s pretty darn good for a laptop.

Running LLaMa on a Raspberry Pi 4

A Raspberry Pi 4 has 13.5 GFLOPS of compute, and ~4GB/s of memory bandwidth. Given this, we’d expect to see ~2 tokens/s with the 7B model if it was memory bound. Given that we’re currently seeing ~0.1 tokens/s, I suspect we’re actually compute-bound (although this is a stab in the dark— I can’t find enough information about the specs for a Raspberry Pi to determine this with any precision).

Subscribe now

Summary

Memory bandwidth is the limiting factor in almost everything to do with sampling from transformers. Anything that reduces the memory requirements for these models makes them much easier to serve— like quantization! This is yet another reason why distillation, or just training smaller models for longer, is really important.

Note: I’m not an expert in CUDA, so I probably have errors in my math. If so, please let me know— I’ll update the post and credit you.

Resources on transformer inference performance:

Thank you to Kaushik Patnaik, immortal_333, and Arthur Allshire for reading & commenting on early drafts of this, and Salim Fakhohuri + Shuming Hu for pointing out errors in my math.

Errors that have been corrected from earlier versions:

    I was missing the batch term in the latency_compute equation.

    I had an extra factor of 2 in the latency_memory equation.

1

I learned almost all of the math surrounding transformer performance from their article; they deserve full credit.

2

Although we obviously don’t need to calculate the number of parameters for the LLaMa models, as we know them. The equation is useful as a sanity check.

3

For a more detailed discussion showing that this is the case, check out kipply’s article.

4

I hedge with “almost” here, but I’m not aware of any counterexamples.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLaMa.cpp 内存带宽 硬件运行 模型量化
相关文章