AI Exascale: In This Club, You Must ‘Earn the Exa’

EnterpriseAI 2024年10月18日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

近年来，媒体报道中频繁出现“AI Exascale”这一词汇。然而，除了夸耀“Exascale”甚至“Zettascale”这些术语外，这些报道并没有提供足够的证据来证明使用“Exascale”这一术语的合理性。HPC领域的人士理解“Exascale”的含义，指代计算系统达到每秒一千万亿次浮点运算（ExaFLOPS）。而对于新手来说，前缀“exa”代表10^18，FLOPS代表每秒浮点运算次数（如加法和乘法）。ExaFLOPS代表每秒一千万亿次浮点运算。更具体地说，根据维基百科，“Exascale计算指的是能够以每秒至少10^18次IEEE 754双精度（64位）运算（乘法和/或加法）进行计算的计算系统。”测量系统FLOPS速率需要运行开源的高性能LINPACK基准测试程序。虽然还有其他程序可以测量FLOPS速率，但HPLinpack拥有自1993年以来的历史记录。事实上，有一个名为TOP500的列表，每年更新两次，报告使用双精度FLOPS的基准测试性能。为什么是双精度？因为对于这些大型系统解决的许多数值问题来说，双精度可以提供最佳的答案。目前，根据TOP500列表，全球有两台系统属于Exascale级别：1.来自DOE/SC/橡树岭国家实验室的Frontier，达到1.206 ExaFLOPS，理论峰值为1.715 ExaFLOPS。2.来自DOE/SC/阿贡国家实验室的Aurora，达到1.012 ExaFFLOPS，理论峰值为1.980 ExaFLOPS。需要注意的是，对于每台机器来说，理论峰值性能是系统每个组件的最高性能速率之和；也就是说，如果每个组件都全速运行，不考虑任何底层应用程序（在TOP500中，这被报告为Rpeak）。在实际应用中，理论速率永远无法实现，因为在技术上，“真实应用中还有其他事情发生”。其次，最大限度达到的性能（由TOP500列表报告的Rmax）是使用HPLinpack基准测试测量的。其他基准测试或应用程序可能会从机器中挤出更多的FLOPS，但使用HPLinpack是因为它有悠久的历史记录，可以作为标准尺度。最后，其他大型机器可能选择不运行基准测试或提交其结果到TOP500。其他Exascale级别的机器正在建设中，因此该俱乐部将继续扩大。此外，HPC社区也认识到高性能计算（HPC）和人工智能（AI）工作负载正在融合。传统上，“TOP500 HPC”机器专注于为物理、化学和生物学中的现象建模进行计算，而驱动这些计算的数学模型在很大程度上需要64位精度。另一方面，AI中使用的机器学习方法使用32位甚至更低的浮点精度格式来实现预期结果。有一个混合格式基准测试，HPL-MxP，被用作评估新的混合模式（HPC&AI）系统的方法。运行和管理这些系统并非易事。这些系统是高性能计算的巅峰之作。它们的设计、建造和测试都采用了最先进的技术。鉴于目前对Exascale的理解和共识，当最近的公告宣称基于Nvidia Blackwell GPU的“Exascale”甚至“Zettascale”（10^21 FLOPS）系统时，人们自然会感到惊讶。当然，Blackwell GPU是HPC和AI应用程序SIMD计算的强大工具，但用未经测量和人为编造的性能指标对其进行标记，至少可以说是不诚实的。我们必须问，这些“令人瞠目结舌”的数字是如何从未建成的系统中产生的？在纸面上击败世界上最快的机器的过程实际上非常简单；然而，首先我们需要稍微绕一下，谈谈浮点数。

👨‍💻 在计算机中表示数字是一项棘手的任务，因为计算机是有限的，无法表示所有可能的数字，在科学计算中，应用程序使用浮点或FP进行编程。

🧮 科学和技术计算中使用两种基本类型的FP数字，这些数字由用于表示数字的位数（1和0）来衡量，32位单精度类型范围约为-3.40282347E+38到-1.17549435E-38，或从1.17549435E-38到3.40282347E+38，精度约为七位小数。64位双精度类型范围约为1.797693134862315E+308到-2.225073858507201E-308，或从2.225073858507201E-308到1.797693134862315E+308，精度约为十五位小数。

📈 几乎所有HPC计算都使用FP64（或FP32和64的组合），因为当答案具有更高的精度时，它们更有用。更高的精度意味着更好的结果，但计算成本需要更多时间，因为双精度运算需要对两个64位数字进行运算才能得到第三个64位数字。对于CPU来说，有经过测试和优化的库可以使用单精度和双精度进行复杂的数学运算。GPU供应商也提供单精度和双精度数学库。对于HPC系统来说，最高性能始终以双精度衡量。

🤖 GenAI和LLM的重点是使用海量数据来创建（训练）模型并确定“权重”。一旦模型训练完成，这些权重将在模型被查询（推理）时用于引导模型。这些权重使用高精度进行训练（计算），并且需要大量的内存和计算才能遍历。LLM使用的一个技巧称为量化，其中权重的精度被降低。在许多情况下，模型使用较低精度的权重时会表现相同，从而降低了运行模型的计算要求（可以从Hugging Face下载并在笔记本电脑上运行的模型已经过量化）。

📉 在量化游戏中，越少往往越好。因此，在生成式AI中引入了许多新的低精度格式（FP16、BFLOAT16和FP8）。最新且可能是最小的格式是FP4格式。没错，使用4位来表示浮点数。

📈 FP4是生成式AI的一种很好的优化方法，可以加速量化模型的推理。Nvidia Blackwell架构技术简报将Tensor核心FP4速率列为密集矩阵的理论20 petaFLOPS。这就是最近Exascale甚至Setta Scale公告的提示。

📉 最近，有消息称，使用4,608个Nvidia Blackwell GPU正在建造一台90 ExaFLOPS的机器。简单的数学：20 petaFLOPS x 4,608 GPU 等于 82,160 ExaFLOPS。BAM！我们得到了一台Exaflop机器。“AI ExaFLOPS”的名称并不重要，因为没有运行任何AI来获得这个数字。

📊 同样，宣布了一台使用131,072个Nvidia Blackwell GPUS的Zettascale机器。再次，20 petaFLOPS x 92,160 等于 1,843 ExaFLOPS（或 1.8 zettaFLOPS）。BAM！我们得到了一台ZettaFLOP机器。再次，称之为“AI zetaFLOPS”是愚蠢的，因为这台未完成的机器上没有运行任何AI。

📱 比这些Blackwell机器更令人印象深刻的是全球智能手机超级计算机。假设所有手机通过手机号码在全球网络中连接，理论上可以将它们的计算能力组合起来。截至2024年6月，全球约有72亿（10^9）部智能手机，平均手机处理器可以以大约每秒一万亿次（10^12）单精度运算的速度运行。使用“加总FLOPS”方法，可以创建一个智能手机超级计算机，其计算能力达到每秒一千万亿次（10^21 FLOPS），也就是ZettaFLOPS。BAM！人人都有ZettaFLOPS。

🚨 当然，在登上TOP500榜单之前，还需要解决一些细节问题。顺便说一句，如果你的手机无缘无故地开始发热，可能是HPLinpack运行，或者更可能是你刚下载的某个新应用程序的一部分加密挖矿正在运行。

🚗 每个好的论点都需要一个汽车类比。在FP4计算的情况下，它类似于这样。一辆普通的双精度FP64汽车重约4,000磅（1814公斤）。它非常擅长导航地形，可以舒适地容纳四个人，而且油耗为30英里/加仑。现在，考虑一下FP4汽车，它已经被拆卸到250磅（113公斤），油耗惊人地达到480英里/加仑。

📉 好消息。你拥有有史以来最好的油耗！除了，你没有提到你这款神奇的FP4汽车的几个特点。

📉 首先，

There have been some recent press releases and headlines with the phrase “AI Exascale” in them. Other than flaunting the word exascale or even zettascale, these stories do not provide enough information to justify using the term exascale. Those in the HPC community understand words like exascale, which means the computer in question has achieved an exaFLOPS of sustained performance. For the newbies, the prefix exa is shorthand for 10^18, and FLOPS is Floating Point Operations Per Second (things like addition and multiplication). An exaFLOP is one quintillion floating point operations per second.

More specifically, per Wikipedia, “Exascale computing refers to computing systems capable of calculating at least 10^18 IEEE 754 Double Precision (64-bit) operations (multiplications and/or additions) per second (exaFLOPS).”

Measuring the FLOPS rate of a system requires running the open-source High-Performance LINPACK benchmark program. There are other programs to measure the FLOPS rate, but the HPLinpack has a historical record dating back to 1993. As a matter of fact, there is a list, The TOP500, updated twice a year, that reports the performance of this benchmark using double precision FLOPS. Why double precision? Well, that is what gives the best answers for many of the numerical problems that these huge systems solve.

Currently, there are two systems worldwide that are in the exascale club according to the TOP500 list:

1. Frontier from DOE/SC/Oak Ridge National Laboratory reaching 1.206 ExaFLOPS with a theoretical peak of 1.715 ExaFLOPS

2. Aurora from DOE/SC/Argonne National Laboratory reaching 1.012 ExaFFLOPS with a theoretical peak of 1.980 ExaFLOPS

A few things to note about these numbers. First, for each machine, theoretical peak performance is the sum of the maximum performance rate for each component of the system; that is if each component was running full-tilt with no regard for any underlying application (In the TOP500, this is reported as Rpeak). In practice, theoretical rates are never achieved because, in technical terms, “there is other stuff going on” in real applications.

Second, the maximally achieved performance (Rmax as reported by the TOP500 list) is measured using the HPLinpack benchmark. Other benchmarks or applications may squeeze more FLOPS out of a machine, but HPLinpack is used because it has a long historical record and can be used as a standard yardstick.

Finally, other large machines may have chosen not to run the benchmark or submitted their results to the TOP500. Other exascale class machines are under construction, so the club will expand.

In addition, the HPC community also recognizes the emerging convergence of high-performance computing (HPC) and artificial intelligence (AI) workloads. While traditional “TOP500 HPC” machines focused on computing for modeling phenomena in physics, chemistry, and biology, the mathematical models that drive these computations require, for the most part, 64-bit accuracy. On the other hand, the machine learning methods used in AI achieve desired results using 32-bit and even lower floating-point precision formats. There is a mix format benchmark, HPL-MxP, that is being used as a way to evaluate new mix-mode (HPC&AI) systems.

Running and managing these systems is not trivial. These systems are the pinnacle of high-performance computation. They are designed, built, and tested with the best technology available.

Nip This in the Bud

Given the current understanding and consensus on the meaning of exascale, one can certainly understand the surprise when recent announcements touted “exascale” and even “zettascale” (10^21 FLOPS) systems based on the Nvidia Blackwell GPU. Sure, the Blackwell GPU is a powerhouse of SIMD computation for both HPC and AI applications, but tagging it with unmeasured and contrived performance metrics is disingenuous, to say the least.

One must ask, how do these “snort your coffee” numbers arise from unbuilt systems? The process of beating the world’s fastest machine, cough, on paper, cough, is actually very simple; first, however, we need to take a short detour and talk about floating point numbers.

A Few Bits About Floating Point Format

Representing numbers in computers is a tricky task. Computers are finite and thus cannot represent all possible numbers. In scientific computing, applications are programmed using Floating Point or FP for short.

Two basic types of FP numbers are used in scientific and technical computing. These numbers are measured by the numbers of bits (ones and zeros) used to represent a number.

32-bit single precision types with a range of around -3.40282347E+38 to -1.17549435E-38, or from 1.17549435E-38 to 3.40282347E+38 and a precision of about seven decimal digits

64-bit double precision types with a range of around 1.797693134862315E+308 to -2.225073858507201E-308, or from 2.225073858507201E-308 to 1.797693134862315E+308 and a precision of about fifteen decimal digits.

Values too large or too small for these ranges will cause an error.

The representation of a 32 and 64-bit double-precision floating point number is shown in the figures below:

FP32 Floating point format for the number 0.15625 (Source: Wikipedia)

FP64 Floating point format (Source: Wikipedia)

Almost all HPC calculations use FP64 (or a combination of FP32 and 64) because answers are more useful when they have more precision. Better precision makes for better results, but computational costs require more time because math with double precision requires that you twiddle two 64-bit numbers to get a third 64-bit number. There are tested and optimized libraries for CPUs that use single and double precision to do complex math. GPU vendors also provide single and double-precision math libraries. For HPC systems, top performance is always measured in double precision.

Enter GenAI. The point of GenAI and LLMs is to create (train) models and determine “weights” using massive amounts of data. Once a model is trained, these weights are used to steer the model when it is queried (inference). These weights are trained (calculated) with high precision and require large amounts of memory and computation to traverse. One trick used with LLMS is called quantization, where the precision of weights is reduced. In many cases, the model will behave the same with lower precision weights, thus reducing the computational requirements to run a model (The models that can be downloaded and run on your laptop from Hugging Face have been quantized.)

In the quantization game, less is often better. For this reason, there have been many new lower precision formats introduced in Generative AI (FP16, BFLOAT16, and FP8). The most recent and possibly the smallest is the FP4 format. That is correct, using 4-bits to represent a floating point number.

The format of these low-precision numbers is far from settled. A recent post on X/Twitter by user @fclc@mast.hpc.social commented on yet another FP8 format “Unless I’ve missed one, *excluding* block floats, this puts us at 18 total FP8 formats“

Returning to the FP4 format. For those who did not pay attention in computer science class, 4-bits only provide sixteen possible numbers or levels of difference for a weight. FP4 is the smallest possible float size that follows all IEEE principles, including normalized numbers, subnormal numbers, signed zero, signed infinity, and multiple NaN values. It is a 4-bit float with 1-bit sign, 2-bit exponent, and 1-bit mantissa. All the possible numbers ranging from -3 to 3 are shown in the table below. The columns have different values for the sign and mantissa bits, and the rows have different values for the exponent bits.

FP4 Format with all possible numbers. (Source: Wikipedia)

Spec Sheet Supercomputing

FP4 is a good optimization for Generative AI and can speed up inference with quantized models. The Nvidia Blackwell Architecture Technical Brief lists the Tensor core FP4 rate at a theoretical 20 petaFLOPS for dense matrices. And here is the tip-off for the recent exascale and even setta scale announcements.

Recently, there was an announcement that a 90 exaFLOPS machine was being built using 4,608 Nvidia Blackwell GPUs. Simple math: 20 petaFLOPS x 4,608 GPU equals 82, 160 exaFLOPS. And BAM! We have an exaflop machine. The “AI exaFLOPS” moniker does not matter because no AI was run to obtain this number.

Similarly, a zettascale machine was announced that used 131,072 Nvidia Blackwell GPUS. Again, 20 petaFLOPS x 92,160 equals 1,843 exaFLOPS (or 1.8 zettaFLOPS) And BAM! we have a zettaFLOP. Again, calling it “AI zetaFLOPS” is silly because no AI was run on this unfinished machine.

More impressive than these Blackwell machines is the worldwide smartphone supercomputer. Assuming all phones are connected by a phone number over a worldwide network, their computation, in theory, could be combined. As of June 2024, there are approximately 7.2 billion (10^9) smartphones in the world, and the average cell phone processor can run at approximately ten teraFLOPS (10^12) of single precision. Using the “add the FLOPS” method creates a smartphone supercomputer that boasts 10^21 FLOPS or a zettaFLOP. BAM! zettaFLOPS for everybody.

Of course, some minor details need to be worked out before we land on the TOP500 list. By the way, if your phone starts getting hot for no reason, it might be an HPLinpack run or, more likely, some crypto mining running as part of the fun new app you just downloaded.

The Requisite Car Analogy

Every good argument needs a car analogy. In the case of FP4 computing, it goes something like this. The average double precision FP64 car weighs about 4,000 pounds (1814 Kilos). It is great at navigating terrain, holds four people comfortably, and gets 30 MPG. Now, consider the FP4 car, which has been stripped down to 250 pounds (113 Kilos) and gets an astounding 480 MPG.

Great news. You have the best gas mileage ever! Except, you don’t mention a few features of your fantastic FP4 car.

First, the car has been stripped down of everything except a small engine and maybe a seat. What’s more, the wheels are 16-sided (2^4) and provide a bumpy ride as compared to the smooth FP64 sedan ride with wheels that have somewhere around 2^64 sides. There may be places where your FP4 car works just fine, like cruising down Inference Lane, but it will not do well heading down the FP64 HPC highway. Different strokes for different folks.

Going Forward

The specs sheet exascale numbers are often reported as “AI ExaFLOPS,” which does not grant the use of exascale. To get into the exascale club, you need to supply the bouncers with the following information.

The hardware and application that was used to measure the FLOPS rate

The precision of the floating point used in the measurement (e.g., FP64, FP32, FP6, FP4)

It is good form to refer to non-computed numbers (spec sheet summations) as “theoretical peak” for a specific precision, but this will not get you in the club. Fuzzing things up with “AI FLOPS” will not help either. The Nvidia Blackwell is a blazingly fast GPU, and providing actual measured numbers with the simple details mentioned will easily get you into the club.

Nip This in the Bud

A Few Bits About Floating Point Format

Spec Sheet Supercomputing

The Requisite Car Analogy

Going Forward

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签