少点错误 07月19日 06:16
Agents lag behind AI 2027's schedule
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文评估了“AI 2027”时间线预测的准确性,该预测旨在描绘人工智能“智能爆炸”的进程,并预计到2027年将出现高度强大的AI系统。文章以2025年7月为时间节点,回顾了该时间线发布后约2.5个月内的进展。通过分析OSWorld和SWEBench两项基准测试的具体预测数据,作者发现实际进展与预测存在差距,尤其是在OSWorld和SWEBench的得分上,AI代理的性能尚未达到预测的里程碑。文章指出,尽管存在新的AI模型发布,但其在公开基准测试中的表现仍未完全匹配“AI 2027”的乐观预测,并推测实际AI发展进程可能比预测的要慢。作者也强调,基准测试并非衡量AI能力的唯一标准,且短期预测易受模型发布节奏等因素影响,但总体而言,该评估为AI发展时间线的讨论提供了参考。

🎯 **AI 2027时间线预测的现状评估**:文章以2025年7月为时间点,评估了2025年4月发布的“AI 2027”时间线预测的准确性。该时间线旨在描绘AI“智能爆炸”的进程,并预测2027年将出现极其强大的AI系统。作者指出,自预测发布以来,约有10%的时间过去,现已成为评估早期预测的合适时机。

📊 **基准测试预测与实际表现的差距**:“AI 2027”时间线在“中2025:蹒跚学步的代理”章节中,对OSWorld和SWEBench两项基准测试的AI代理性能提出了具体预测。OSWorld预测得分65%,SWEBench预测得分85%。然而,截至2025年7月中旬,OSWorld的最新SOTA(State-of-the-Art)得分仅为45.2%,仍落后预测20个点;SWEBench的最新SOTA得分为75%,也落后预测10个点。这表明AI代理在这些关键基准上的进展未能完全达到预期。

🚀 **新AI模型发布及其对预测的影响**:文章提到了OpenAI新发布的ChatGPT Agent模型,该模型在OSWorld上的表现虽然可能超出当前SOTA,但作者推测其不太可能达到预测的65%里程碑。对于SWEBench,新模型表现与早期模型相当或略逊一筹。作者认为,如果AI能力有显著飞跃,通常会在模型发布时得到突出展示,而这些基准上的差距暗示了预测的乐观性。

⏳ **对AI发展时间线的推测与反思**:鉴于早期基准测试预测的滞后,作者推测“AI 2027”的整体时间线可能需要延长,例如,达到“自给自足AI”的里程碑可能推迟到2029年。文章也讨论了短期预测的局限性,如模型发布延迟可能影响结果,以及基准测试不完全代表真实世界效用。然而,作者仍认为,这些评估为AI发展速度的讨论提供了有价值的参考,并建议定期跟踪和分析AI时间线,甚至考虑制定更长期的AI发展叙事。

💡 **基准测试的局限性与AI进展的衡量**:文章承认基准测试并非衡量AI能力的完美方式,有时分数可能夸大或低估实际能力。尽管如此,作者认为基准测试仍然是评估AI进展的最佳量化手段,比模糊的模型能力讨论更为可靠。作者强调了持续就AI时间线进行讨论的重要性,并提出社区可以合作重写更长期的AI发展叙事,如“AI 2030”,以便进行持续的实时分析和调整。

Published on July 18, 2025 9:49 PM GMT

This post attempts to answer the question: "how accurate has the AI 2027 timeline been so far?"

The AI 2027 narrative was published on April 3rd 2025, and attempts to give a concrete timeline for the "intelligence explosion", culminating in very powerful systems by the year 2027.

Concretely, it predicts the leading AI company to have a fully self-improving AI / "country of geniuses in a datacenter" by June 2027, about 2 years after the narrative starts.

Today is mid-July 2025, about 2.5 months after the narrative was posted. This means that we have passed about 10% of the timeline up to the claimed "geniuses in a datacenter" moment. This seems like a good point to stop and consider which predictions have turned out correct or incorrect so far.

Specifically, we should consider the points made in the narrative's first section, "Mid 2025: Stumbling Agents". The term "mid-2025" is fairly ambiguous, but we're definitely at this point in time.[1]

There is a prediction tracker for the narrative, posted in May, that attempts to answer this question. So far it claims "100% accuracy"; however, the resolved bulletpoints from the narrative are either obvious ("The better AI agents are expensive"), very general, or were already true when the narrative was published (for pre-existing agents like Devin and Operator).

A better way to check the narrative's accuracy, in my opinion, is to look for the concrete / quantitative predictions. The chapter offers only two such predictions, both concerning the state-of-the-art (SOTA) scores for two benchmarks: OSWorld and SWEBench.

OSWorld

AI 2027's footnote 9 states:

Specifically, we forecast that they [mid-2025 agents] score 65% on the OSWorld benchmark of basic computer tasks (compared to 38% for Operator and 70% for a typical skilled non-expert human).

OSWorld is a benchmark measuring "Open-Ended Tasks in Real Computer Environments" (like accessing folders and updating spreadsheets).

As mentioned, the SOTA in time of publishing was 38.1% for Operator (using ChatGPT 4o as a base model). Since then there weren't any major improvements; the SOTA is now 45.2%, a 7-point improvement, still lagging 20 points behind the prediction.

An important caveat is that OpenAI has recently announced a new agent model, ChatGPT Agent, with improved capabilities compared to Operator.

OpenAI has mentioned several benchmark improvements for this model in their blog post, however it did not mention OSWorld performance at all.

Given the limited info, my best guess is:

My reasoning: a 20-point jump in capabilities will likely be noticed by the company, who (like most AI companies) tend to post a curated list of the most impressive benchmarks in each of their new models, and so far this hasn't been mentioned.

This post will be updated with further information on OSWorld if it arises.

SWEBench

Footnote 10 states:

We forecast that mid-2025 agents will score 85% on SWEBench-Verified.

SWEBench measures models' capabilities in solving real-world Github issues; SWEBench-Verified is a subset of tasks that is curated by OpenAI for more reliability.

The SOTA in time of publishing was around 65% (achieved by Augment Agent v0 on March 2025). The SOTA today is 75%, a 10-point improvement. We're still 10 points short of the predicted milestone.

It is possible that improvements on the benchmark have been achieved and not yet published, but that seems unlikely.[2]

Thoughts

This post considers near-term predictions, and the details written here are likely to be different a few months or even weeks from now.

Because this analysis focuses on the near term, real-world outcomes are likely to be influenced by minute details; a one-month delay in releasing a new model (say, due to marketing reasons) could completely change the narrative.

Thus, one shouldn't take these fine-grained details too seriously. The main point of AI 2027 is not to give a completely accurate reading of the timelines, but to play out a hypothetical example of what an intelligence explosion might look like. The authors themselves admit that the current narrative is too aggressive and the real timeline will likely be slower.

What these results do provide is evidence that, indeed, the suggested AI 2027 timelines are unlikely to be met; this should perhaps nudge us towards slightly longer timelines.

The first set of predictions (the two agent benchmarks) might take twice as long in reality - say, 6 months to hit those milestones instead of 3.

If we stretch the entire timeline by that factor, everything would run about 2x slower, so a self-improving AI is achieved by 2029 (four years from now) instead of 2027.

This is all getting highly speculative, and can be taken in different directions:

Also, it's clear that benchmarks are not a perfect way to measure capabilities, and often benchmark scores don't reflect the models' real-world utility. I believe benchmarks are often played too optimistically (companies achieve impressive-looking scores by optimizing narrow tasks while real-world performance is disappointing), but the reverse is also possible (a benchmark is unnecessarily difficult and a mediocre score hides impressive real-world performance). 

Thus, "SWEBench progress is slower than expected" does not necessarily amount to "agent progress is slower than expected". However, I do believe there's a non-zero correlation between the two statements. And while imperfect, benchmarks are still our best way to test progress, rather than ambiguous discussion on which models "seem better". METR has done impressive research in formalizing benchmarks and gaining insights from partial numerical data.

 

I believe that frequent discussion on how the timelines play out is valuable; a monthly community post analyzing the AI 2027 timelines can be a great way to check our assumptions.

If the consensus is that the original timeline was too aggressive, another idea for community effort is rewriting the original plot with a longer-term horizon, say an "AI 2030" narrative, which will be published here and then analyzed in real-time, month by month.

 

 

  1. ^

     Arguably there's still time. In 2.5 months we'll enter Q4 of 2025; at that point, the Stumbling Agents predictions should have been fully played out, and the next chapter of the narrative should start ("Late 2025: The World’s Most Expensive AI").

  2. ^

    The new OpenAI agent model does mention SWEBench Verified in its system card, where it is roughly comparable (or slightly behind) o3's performance.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 2027 人工智能时间线 AI代理 基准测试 智能爆炸
相关文章