少点错误 2024年08月10日
Fermi Estimating How Long an Algorithm Takes
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨通过估算算法执行时间来优化程序,以一个素数测试算法为例,分析其执行过程及优化方法。

文章首先介绍了一个简单的素数测试算法modified_sieve,通过分析指出大部分时间会花费在计算模上,估算在现代CPU上运行该算法的时间。实际运行结果显示,实际程序运行时间比估算的长,因为存在未考虑的开销。

接着提出算法改进,只需检查到√i的素数,改进后的算法modified_sieve2理论上应快100倍,但实际只快了3.5倍。分析原因是每次循环添加的过滤器增加了工作量,且使CPU更难预测需进行的计算。

最后再次改进算法modified_sieve3,通过单独跟踪需测试的素数并按需更新,使模计算更具可预测性,最终获得了显著的速度提升。文章强调通过估算发现潜在优化空间的重要性。

Published on August 10, 2024 1:34 AM GMT

This is a "live" Fermi estimate, where I expect to make mistakes and haven't done much editing.

If you're working on a simple mathematical program, you can attempt to estimate the number of floating point or integer operations, compare that to statistics on CPU latencies/throughput, and use that to get a rough estimate.

Let's try estimating how long it would take to run a simple primality test, a modified Sieve of Eratosthenes:

function modified_sieve(n)  known_primes = [2]  for i in 3:2:n    mods = [i % p for p in known_primes]    if all(mods .> 0)      push!(i, known_primes)    end  end  return known_primesend

In this algorithm, we store a list of known prime numbers. We iterate through odd numbers, and check whether any of them are divisible by a known prime, by calculating the modulus against every known prime in our list. If it's not divisible by any of them, we add it to our list of primes.

There's a lot we could do to come up with a better algorithm, but what I'm interested in is estimating how fast an ideal execution of this algorithm could be. How long could we expect this to run, on a modern CPU?

I expect, looking at this, that the vast majority of the time is going to be spent on mods = [i % p for p in knownprimes]. Calculating a modulus is something like 30-50 cycles according to the few google searches I found, and everything else should be significantly less than that. 

The number of mods we need to calculate for any i is the number of known primes, which is roughly . Let's say we're getting all the primes until 100,000. Then we should need to calculate about <span class="mjx-math" aria-label="\Sigma{i=3}^{99,999}\frac{i}{log(i)}">, which is about 200 million mods. Times 30-50 cycles is about .6-1 billion cycles. On a 3gHz processor, that should take about a 200-333 milliseconds.

What do we actually see?

using BenchmarkTools@btime modified_sieve(100_000)>    352.089 ms (240255 allocations: 2.12 GiB)

That...shouldn't actually happen. Actual programs usually take 3x-10x as long as my estimates, because there's a lot of overhead I'm not accounting for. This suggests that there's some optimization happening that I didn't expect. And looking at the generated code, I do see several references to simdloop. Ok then. That just happened to cancel out the overhead.

Let's try an algorithmic improvement. We don't actually need to check against all primes – we only have to go up to  each time. That brings us down to an estimated 2 million mods, which in theory should be 100 times faster. 

function modified_sieve2(n)  known_primes = [2]  for i in 3:2:n    mods = [i % p for p in known_primes if p^2 < i]    if all(mods .> 0)      push!(known_primes, i)    end  end  return known_primesend@btime modified_sieve2(100_000)>   104.270 ms (336536 allocations: 83.04 MiB)

Hmm. It's faster, but nowhere near as much as expected – 3.5x vs. 100x. Adding the filter each loop probably messes us up in a couple ways: it adds a fair bit of work, and it makes it harder for the CPU to predict which calculations need to happen. Let's see if we can get a speed up by making the primes to be checked more predictable:

function modified_sieve3(n)  known_primes = [2]  small_primes = Int[]  next_small_prime_index = 1  for i in 3:2:n    if known_primes[next_small_prime_index]^2 <= i            push!(small_primes, known_primes[next_small_prime_index])      next_small_prime_index += 1          end    mods = [i % p for p in small_primes]    if all(mods .> 0)      push!(known_primes, i)    end  end  return known_primesend

Now we added a bunch of extra code to make the mod calculations more predictable: we track the primes that need to be tested separately, and only update them when needed. And it pays off:

@btime modified_sieve3(100_000)>    9.291 ms (150010 allocations: 25.87 MiB)

This is a major benefit of making Fermi estimates. We predicted a possible 100x speedup, and got only 3.5x.  That suggested that there was more on the table, and were able to get a further 10x. If I hadn't made the estimate, I might have thought the filter version was the best possible, and stopped there.

And I find this to translate pretty well to real projects – the functions where I estimate the greatest potential improvements are the ones that I actually end up being able to speed up a lot, and this helps me figure out whether it's worth spending time optimizing a piece of code.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

算法估算 程序优化 素数测试
相关文章