Log-linear Scaling is Worth the Cost due to Gains in Long-Horizon Tasks

Published on April 7, 2025 9:50 PM GMT

This post makes a simple point, so it will be short. I am happy to discuss more in the comments, and based on this write a longer post later. Much prior work (eg: [1]) has shown that exponential data and compute is required for each unit improvement in accuracy. A popular argument this leads to:

Scaling compute and data is economically not viable above a threshold

This argument has a key issue. Log-linear gains in accuracy (or loss) are shown on "one-step" benchmarks such as direct QA. However, economic benefits primarily arise from tasks which require long thinking, output and action horizons. For this, I propose tracking k-step success rate, i.e. the probability that a k-step solution achieves its goal. Most benchmarks previously analyzed for log-linear scaling have at best a few (small constant) steps. For simplicitly, let us assume this constant is 1, this does not affect the rest of the argument (based on asymptotic calculations in k) beyond a constant factor.

We can make two (simplifying) assumptions to connect log-linear scaling in 1-step accuracy to k-step success rate. (1) Each step has an independent probability to succeed, related to the 1-step accuracy by a constant. (2) All steps need to be correct for the full k-step solution to succeed. This is true for many useful tasks, like solving math problems, and an agent executing a task on the web. While recovering from failures is possible, subsequent steps might violate Assumption (1), so it's best to simplify this for now. Now the main mathematical argument is as follows:

Let us define the 1-step accuracy as $x$ . Every unit increase in 1-step accuracy from $x$ % to $(x + 1)$ % leads to an improvement in k-step accuracy of $(x + 1)^{k} - x^{k}$ . The absolute improvement is ~ $k x^{k}$ by taking the derivative of $f (x) = x^{k}$ , i.e. the improvement compounds based on the horizon-length. The relative improvement is $\frac{k}{x}$ by the binomial approximation. Thus the k-step success rate improves exponentially in task horizon length.

As inference-compute starts outweighing pretraining compute and AI is used for longer-horizon tasks like automated coding, jobs and research, log-linear scaling of x with pre-training compute will be worth it as the k-step success rate will improve near-linearly. This might explain recent results that the length of tasks that AI can do is increasing linearly with time.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签