少点错误 05月06日 10:57
Five Hinge‑Questions That Decide Whether AGI Is Five Years Away or Twenty
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能(AI)发展时间线的辩论,指出所有时间线预测最终都取决于五个关键的量化分歧点。乐观地看待其中三个,AI可能在2026-2029年实现;悲观地看待其中两个,时间线将推迟到2040年之后。这些分歧点包括:应该关注哪个经验曲线(硬件支出、算法效率或收入);纯软件递归自我改进(RSI)是否能比硬件安装更快地加速能力;一旦达到广泛的“代理”可靠性,计算能力如何转化为经济价值;自动化一半的关键任务是否会引发失控增长;以及对齐担忧、监管和供应链摩擦在多大程度上会减缓规模化。

📈**关注曲线选择:** 短期预测者强调训练计算和算法效率的增长,而长期预测者则关注实际部署的资本(如英伟达数据中心收入)。一个可验证的预测是,如果到2026年第四季度,英伟达的数据中心收入以及谷歌、Meta、OpenAI等的AI资本支出同比增长了一倍,那么硬件曲线仍然是指数级的,短期阵营得分。

🤖**软件自我提升:** 关键问题是,在现有数据中心内,能力的提升是否能比物理经济提供新的晶圆、电力和冷却更快?短期预测者强调agentic研究助理的兴起,而长期预测者则认为,到目前为止,每一个主要的算法突破都消耗了大量的GPU时间进行蛮力实验。一个可测试的预测是,如果2027年中期训练的最快模型所需的时间是18个月前同等规模运行的四分之一,那么这将有力地证明纯软件加速是真实的。

💰**计算到价值的飞跃:** 目前,即使在ChatGPT之后,AI每H100-GPU-年的收入约为1万美元。短期预测者认为,一旦可靠性超过人类中位数,价值将呈指数级增长。长期预测者则认为,更新、更agentic的模型每个答案需要多一到三个数量级的推理tokens,因此推理成本呈上升趋势。一个可验证的预测是,到2027年底,每个已部署的H100的收入。如果该数字超过10万美元,则短期阵营预测的效率阈值已经到来。

⚙️**自动化程度:** 短期预测认为,一旦一半的远程认知劳动变得廉价,工资就会暴跌,资本重新配置,剩余的物理瓶颈将迅速消除。长期预测则认为,总生产力受到最慢的基本部门的限制。一个决定性的经验里程碑是,在通用操作机器人普及之前,是否有任何G7国家实现了连续三年6%的实际人均GDP增长。

🚦**对齐与制度阻力:** 即使能力和经济学都符合预期,社会也可能希望我们放慢速度。监管已经越来越严格。短期乐观主义者反驳说,一旦各国意识到存在生存威胁或获得决定性优势的机会,就会推倒障碍。一个预测是,到2028年1月,至少有一个排名前三的实验室会出于安全原因,公开推迟一个前沿模型六个月或更长时间。

Published on May 6, 2025 2:48 AM GMT

For people who care about falsifiable stakes rather than vibes

TL;DR

All timeline arguments ultimately turn on five quantitative pivots. Pick optimistic answers to three of them and your median forecast collapses into the 2026–2029 range; pick pessimistic answers to any two and you drift past 2040. The pivots (I think) are:

    Which empirical curve matters (hardware spend, algorithmic efficiency, or revenue)Whether software‑only recursive self‑improvement (RSI) can accelerate capabilities faster than hardware can be installed.How sharply compute translates into economic value once broad “agentic” reliability is reached.Whether automating half of essential tasks ignites runaway growth or whether Baumol’s law keeps aggregate productivity anchored until all bottlenecks fallHow much alignment fear, regulation, and supply‑chain friction slow scale‑up

The rest of this post traces how the canonical short‑timeline narrative AI 2027 and the long‑timeline essays by Ege Erdil and Zhendong Zheng + Arjun Ramani diverge on each hinge, and proposes concrete bets that will force regular public updates.

Shared premises

Agreement in the forecasting/timelines community ends at the tempo question.

 

Hinge #1: Which curve do we extrapolate?

The first divide concerns what exactly we should project into the future. Short‑timeline advocates emphasise frontier‑training compute and algorithmic efficiency, or even just the general amalgamation of all benchmarks as "intelligence extrapolation". They point to six straight doublings in effective training FLOP between GPT‑2 and GPT‑4, and they cite scaling‑law papers showing a 1.6x yearly reduction in compute required to reach any fixed loss. This is the engine behind the claim in AI 2027, that “CapEx grows one hundred‑fold in four years.” Long‑timeline authors reply that the best public proxy for capital actually deployed (NVIDIA datacentre revenue, which I think is a flawed metric for other reasons) stopped growing exponentially after the ChatGPT launch. Either you extrapolate benchmark/efficiency performance and hope that eventually gets you to automation, or you only believe if the work itself is actually automated, which is what Erdil is trying to proxy with NVIDIA's revenue.

Overall, I think it's a good argument for the long-timelines side that we've still automated less than probably 0.5% of US GDP with AI. Reasoning models outperform most human mathematicians on FrontierMath, yet we have not seen a single peer‑reviewed theorem produced end‑to‑end by an LLM.[1] If benchmarks elsewhere are saturating, then there's likely a large disconnect between how economically useful an AI is and progress on evaluation suites + benchmarks (also there's probably unintentional cheating plus overfitting on hillclimbed metrics; for more see here).

A bet follows naturally. If by the fourth quarter of 2026 NVIDIA’s datacentre revenue and the combined AI capital expenditures of Google, Meta, OpenAI, etc have doubled year‑over‑year, the hardware curve is still exponential and the short camp scores a point. If they have not, the long camp’s “curve‑bend” story gains credibility.

 

Hinge #2: Can software‑only recursive self‑improvement outrun atoms?

The second hinge asks whether capability can compound inside existing datacentres faster than the physical economy can supply new wafers, power, and cooling. Short‑timeline writers emphasise the rise of agentic research assistants. They note that Devin, Claude Code, etc can complete medium‑sized GitHub projects, etc etc. Once a model designs better chips, compilers, or curricula, capability doublings might come from pure code for a year or two - the classic “foom” or “phase transition.”

Erdil’s rejoinder is empirical. Every major algorithmic breakthrough so far (attention, RLHF, mixture routing) consumed between a thousand and a hundred‑thousand GPU‑days of brute‑force experimentation. Experimentation, in turn, is limited by hardware delivery times and energy budgets. Data, too, is finite: the Villalobos et al. paper suggests we exhaust high‑quality language data this decade.

I'm more sympathetic to long timelines once again. Operator, Devin, and Anthropic’s workstation agent still regularly fail a task as mundane as booking a round‑trip flight with seat selection. Devin has all these stories of it being successful about 5-20% of the time in the real-world. If we cannot yet automate a travel clerk, claims about near‑total remote‑work automation feel premature. There's also a strong bottleneck argument around compute and real-world data which people have made countless times, and I'm not sure where I stand on.

A testable prediction is to watch wall‑clock time for frontier training runs. If the fastest models trained in mid‑2027 require one‑quarter the elapsed time of equally large runs eighteen months earlier without a node shrink below four nanometres, that will be strong evidence that software‑only acceleration is real, and is not highly bottlenecked by compute or real-world experiments slowing the AI researchers down. 

 

Hinge #3: How efficient (and how sudden) is the leap from compute to economic value?

Current AI, even after ChatGPT, earns roughly ten‑thousand dollars per H100‑GPU‑year. This is a figure Erdil emphasises precisely because it equals world GDP per capita. Short‑timeline thinkers retort that this measure hides a looming discontinuity. Once reliability crosses the human median, the “train once, deploy many” property kicks in: the marginal cost of the N‑th copy collapses to server depreciation and a few cents of electricity. They also foresee specialised inference hardware such as NVIDIA’s Blackwell or Groq or whatever, plus sparsity and mixture routing, pushing cost per useful token down by two orders of magnitude. Moreover, value is highly uneven: AlphaFold’s acceleration of drug discovery might be worth many millions of average coders.

Long‑timeline writers respond that newer, more agentic models already need one to three orders of magnitude more reasoning tokens per answer, so the inference cost trend is rising, not falling.

A bet here is to revisit revenue per deployed H100 by the end of 2027.  If the number exceeds one‑hundred‑thousand dollars, the efficiency threshold the short camp predicts has arrived. If it remains close to today’s ten‑thousand, the sceptics will have been vindicated.

 

Hinge #4: Must we automate everything, or is half enough?

Here the argument turns to macroeconomic theory. Short‑timeline narratives imagine a reaction wheel: once half of remote cognitive labour is cheap, wages plummet, capital reallocates, and the remaining physical bottlenecks fall rapidly because swarms of AI engineers will design better robots, batteries, and even fusion plants. They also note that GDP composition can migrate; maybe more and more value lives in purely digital goods, so traditional Baumol constraints vanish.

Zheng and Ramani retort that William Baumol’s law is brutal. Aggregate productivity is throttled by the slowest essential sector, whether that's housing, logistics, healthcare, and energy. For intance, electricity and the internet each boosted sectoral productivity by a factor of a thousand but raised frontier GDP per capita growth by less than one percentage point. Unless AI swiftly unlocks safe autonomous robotics, cheap power, and fast construction, growth will be stuck.

A decisive empirical milestone is whether any G7 country manages three straight years of six‑percent real GDP‑per‑capita growth before general‑purpose manipulation robots are common. If that occurs, the reaction‑wheel model wins.

 

Hinge #5: Alignment‑driven and institutional drag

Finally, even if capability and economics line up, society might want us to slow down anyway. Regulation is already thickening. Physical bottlenecks appear in grid interconnect queues, water‑cooling permits, and limited HBM memory supply. Alignment worries form a special drag: Anthropic’s Responsible Scaling Policy and OpenAI’s Preparedness Framework both envision voluntary pauses.

Short‑timeline optimists counter with the Manhattan‑mode story: once states perceive an existential threat or an opportunity for decisive advantage, they bulldoze barriers. The AI 2027 scenario predicts permissive special‑economic zones where training can continue at maximum speed, while regulators lag behind.

We can measure drag directly. One bet is that, by January 2028, at least one top‑three lab publicly delays a frontier model six or more months for safety reasons. Another is that U.S. datacentre megawatt backlogs fall below six months by 2027 and export‑control stringency remains at 2024 levels.

 

Dependency Structure

These hinges are not independent.  If the hardware and algorithmic curves keep doubling, software‑only RSI is more plausible. If RSI works, agent efficiency is likelier to jump and GDP could surge even before robots. On the other hand, strong alignment or regulatory brakes can nullify the whole chain. In practice, pessimistic answers to any two hinges probably push full‑replacement timelines into the 2040 s; optimistic answers to at least three make the late‑2020 s credible.


Timeline debates are tractable because they hinge on the above five measurable questions, at least to me. We do not need oracular insight, only discipline: decide which side of each hinge you occupy, publish your odds, and update when the world issues new datapoints. The difference between an AGI in 2027 and one in 2047 may feel philosophical, but it will ultimately be written in SEC filings, power‑grid spreadsheets, and lab press releases. Let us read those, not the vibes.

 

  1. ^

    Reasoning models are better than basically all human mathematicians, but still haven't produced one novel mathematical result, suggesting a disconnect between benchmarks and actual real-world use. For instance, if a human gets 25% on FrontierMath (ie Terence Tao), we assume they'll produce great maths research because we know those things to be very correlated. However, the correlation doesn't necessarily hold for LLMs: we could have just hillclimbed on the FrontierMath benchmark and overfitted to that.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI时间线 人工智能 AGI 深度学习 预测
相关文章