少点错误 04月08日 05:57
Log-linear Scaling is Worth the Cost due to Gains in Long-Horizon Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了AI模型在长时任务中的价值,指出虽然模型在单步准确性上的提升是线性的,但对于需要多步推理的任务,其成功率的提升是指数级的。作者认为,随着AI在自动化编码、工作和研究等长时任务中的应用,模型推理计算的重要性将超过预训练计算,预训练计算带来的单步准确性提升将转化为k步成功率的近线性增长。文章通过数学推导,证明了k步成功率随任务长度的增长而呈指数级增长,从而解释了AI在长时任务方面的能力提升。

💡单步准确性与k步成功率:文章指出,在单步准确性上取得对数线性增长,在k步任务中的成功率提升是指数级的。这意味着,模型在完成多步骤任务时的表现,受益于单步准确性的提升。

📈数学推导:作者通过数学推导,假设每一步的成功概率独立,且所有步骤都需要正确才能完成k步任务。基于这些假设,推导出k步成功率的提升与任务长度k成指数关系,从而强调了长时任务中准确性提升的价值。

💻应用场景与经济效益:文章认为,随着AI在自动化编码、工作和研究等长时任务中的应用,推理计算的重要性将超过预训练计算。预训练计算带来的单步准确性提升将在k步任务中带来近线性的成功率提升,从而带来经济效益。

Published on April 7, 2025 9:50 PM GMT

This post makes a simple point, so it will be short. I am happy to discuss more in the comments, and based on this write a longer post later. Much prior work (eg: [1]) has shown that exponential data and compute is required for each unit improvement in accuracy. A popular argument this leads to:

 Scaling compute and data is economically not viable above a threshold

This argument has a key issue. Log-linear gains in accuracy (or loss) are shown on "one-step" benchmarks such as direct QA. However, economic benefits primarily arise from tasks which require long thinking, output and action horizons. For this, I propose tracking k-step success rate, i.e. the probability that a k-step solution achieves its goal. Most benchmarks previously analyzed for log-linear scaling have at best a few (small constant) steps. For simplicitly, let us assume this constant is 1, this does not affect the rest of the argument (based on asymptotic calculations in k) beyond a constant factor.

We can make two (simplifying) assumptions to connect log-linear scaling in 1-step accuracy to k-step success rate. (1) Each step has an independent probability to succeed, related to the 1-step accuracy by a constant. (2) All steps need to be correct for the full k-step solution to succeed. This is true for many useful tasks, like solving math problems, and an agent executing a task on the web. While recovering from failures is possible, subsequent steps might violate Assumption (1), so it's best to simplify this for now. Now the main mathematical argument is as follows:

Let us define the 1-step accuracy as . Every unit increase in 1-step accuracy from % to % leads to an improvement in k-step accuracy of . The absolute improvement is ~ by taking the derivative of , i.e. the improvement compounds based on the horizon-length. The relative improvement is  by the binomial approximation. Thus the k-step success rate improves exponentially in task horizon length. 

As inference-compute starts outweighing pretraining compute and AI is used for longer-horizon tasks like automated coding, jobs and research, log-linear scaling of x with pre-training compute will be worth it as the k-step success rate will improve near-linearly. This might explain recent results that the length of tasks that AI can do is increasing linearly with time. 
 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 模型准确性 长时任务 k步成功率
相关文章