Published on July 18, 2025 9:49 PM GMT

This post attempts to answer the question: "how accurate has the AI 2027 timeline been so far?"

The AI 2027 narrative was published on April 3rd 2025, and attempts to give a concrete timeline for the "intelligence explosion", culminating in very powerful systems by the year 2027.

Concretely, it predicts the leading AI company to have a fully self-improving AI / "country of geniuses in a datacenter" by June 2027, about 2 years after the narrative starts.

Today is mid-July 2025, about 2.5 months after the narrative was posted. This means that we have passed about 10% of the timeline up to the claimed "geniuses in a datacenter" moment. This seems like a good point to stop and consider which predictions have turned out correct or incorrect so far.

Specifically, we should consider the points made in the narrative's first section, "Mid 2025: Stumbling Agents". The term "mid-2025" is fairly ambiguous, but we're definitely at this point in time.^[1]

There is a prediction tracker for the narrative, posted in May, that attempts to answer this question. So far it claims "100% accuracy"; however, the resolved bulletpoints from the narrative are either obvious ("The better AI agents are expensive"), very general, or were already true when the narrative was published (for pre-existing agents like Devin and Operator).

A better way to check the narrative's accuracy, in my opinion, is to look for the concrete / quantitative predictions. The chapter offers only two such predictions, both concerning the state-of-the-art (SOTA) scores for two benchmarks: OSWorld and SWEBench.

OSWorld

AI 2027's footnote 9 states:

Specifically, we forecast that they [mid-2025 agents] score 65% on the OSWorld benchmark of basic computer tasks (compared to 38% for Operator and 70% for a typical skilled non-expert human).

OSWorld is a benchmark measuring "Open-Ended Tasks in Real Computer Environments" (like accessing folders and updating spreadsheets).

As mentioned, the SOTA in time of publishing was 38.1% for Operator (using ChatGPT 4o as a base model). Since then there weren't any major improvements; the SOTA is now 45.2%, a 7-point improvement, still lagging 20 points behind the prediction.

An important caveat is that OpenAI has recently announced a new agent model, ChatGPT Agent, with improved capabilities compared to Operator.

OpenAI has mentioned several benchmark improvements for this model in their blog post, however it did not mention OSWorld performance at all.

Given the limited info, my best guess is:

When tested against OSWorld, ChatGPT Agent will have impressive capabilities, maybe passing the SOTA;However, it will still not reach the predicted 65% milestone.

My reasoning: a 20-point jump in capabilities will likely be noticed by the company, who (like most AI companies) tend to post a curated list of the most impressive benchmarks in each of their new models, and so far this hasn't been mentioned.

This post will be updated with further information on OSWorld if it arises.

SWEBench

Footnote 10 states:

We forecast that mid-2025 agents will score 85% on SWEBench-Verified.

SWEBench measures models' capabilities in solving real-world Github issues; SWEBench-Verified is a subset of tasks that is curated by OpenAI for more reliability.

The SOTA in time of publishing was around 65% (achieved by Augment Agent v0 on March 2025). The SOTA today is 75%, a 10-point improvement. We're still 10 points short of the predicted milestone.

It is possible that improvements on the benchmark have been achieved and not yet published, but that seems unlikely.^[2]

Thoughts

This post considers near-term predictions, and the details written here are likely to be different a few months or even weeks from now.

Because this analysis focuses on the near term, real-world outcomes are likely to be influenced by minute details; a one-month delay in releasing a new model (say, due to marketing reasons) could completely change the narrative.

Thus, one shouldn't take these fine-grained details too seriously. The main point of AI 2027 is not to give a completely accurate reading of the timelines, but to play out a hypothetical example of what an intelligence explosion might look like. The authors themselves admit that the current narrative is too aggressive and the real timeline will likely be slower.

What these results do provide is evidence that, indeed, the suggested AI 2027 timelines are unlikely to be met; this should perhaps nudge us towards slightly longer timelines.

The first set of predictions (the two agent benchmarks) might take twice as long in reality - say, 6 months to hit those milestones instead of 3.

If we stretch the entire timeline by that factor, everything would run about 2x slower, so a self-improving AI is achieved by 2029 (four years from now) instead of 2027.

This is all getting highly speculative, and can be taken in different directions:

AI 2027

Also, it's clear that benchmarks are not a perfect way to measure capabilities, and often benchmark scores don't reflect the models' real-world utility. I believe benchmarks are often played too optimistically (companies achieve impressive-looking scores by optimizing narrow tasks while real-world performance is disappointing), but the reverse is also possible (a benchmark is unnecessarily difficult and a mediocre score hides impressive real-world performance).

Thus, "SWEBench progress is slower than expected" does not necessarily amount to "agent progress is slower than expected". However, I do believe there's a non-zero correlation between the two statements. And while imperfect, benchmarks are still our best way to test progress, rather than ambiguous discussion on which models "seem better". METR has done impressive research in formalizing benchmarks and gaining insights from partial numerical data.

I believe that frequent discussion on how the timelines play out is valuable; a monthly community post analyzing the AI 2027 timelines can be a great way to check our assumptions.

If the consensus is that the original timeline was too aggressive, another idea for community effort is rewriting the original plot with a longer-term horizon, say an "AI 2030" narrative, which will be published here and then analyzed in real-time, month by month.

^{^}
Arguably there's still time. In 2.5 months we'll enter Q4 of 2025; at that point, the Stumbling Agents predictions should have been fully played out, and the next chapter of the narrative should start ("Late 2025: The World’s Most Expensive AI").
^{^}
The new OpenAI agent model does mention SWEBench Verified in its system card, where it is roughly comparable (or slightly behind) o3's performance.

Discuss

OSWorld

SWEBench

Thoughts

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签