少点错误 前天 17:52
How 2025 AI Forecasts Fared So Far
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文基于2024年末至2025年初的AI发展预测调查,分析了421位受访者对AI关键指标的预判。调查涵盖了RE-Bench、SWE-bench Verified、Cybench、OSWorld和FrontierMath等多个基准测试,并探讨了OpenAI的风险评估框架。通过对比OpenAI o3发布前后预测的变化,文章评估了AI在各个领域的进展,并对2025年的AI发展趋势进行总结。

🚀 **RE-Bench**: 调查显示,专家预测2025年AI在RE-Bench任务上的表现将显著提升,预计将缩小2024年AI性能与人类最佳表现之间的差距。目前,AI表现已超越人类专家中位数水平。

💻 **SWE-bench Verified**: 在SWE-bench Verified基准测试中,受访者预计AI的进步将持续,到2025年末可能达到饱和。2024年12月,性能从55%提升至62%,随后增长至70%。

🛡️ **风险评估**: OpenAI的风险评估框架关注模型在网络安全、模型自主性和说服力等方面的风险。在调查期间,Deep Research已达到中等网络安全阈值,o3-mini达到中等模型自主性阈值。

🧠 **FrontierMath**: 在FrontierMath基准测试中,o3的表现远超预期,促使人们对2025年末的AI能力做出更高预测。o3的得分为24%,远高于此前2%的SOTA水平。

Published on May 22, 2025 9:42 AM GMT

At the end of 2024, there was a lot of discussion about whether AI scaling was hitting a wall and whether we would reach AGI soon.

We created a forecasting survey that would track key markers of AI progress in what might turn out to be a pivotal year. After filtering the data, we had 421 unique respondents. All questions were optional.

The survey was open from Nov 30th 2024 to Jan 20th 2025 – during which OpenAI o3 was announced, so we can compare forecasts before and after the announcement.

In this post, we summarise respondents' forecasts, and look at how they're holding up so far. At the end of the year, we'll resolve all the forecasts and write up the results (you can join the mailing list here to get notified about that).

 

Note: I'd recommend reading this post on AI Digest, where you'll find expandable sections for background info, and how responses varied based on forecaster timelines and P(doom).

AI Research: RE-Bench

1) What will be the best normalized score achieved on the original 7 RE-Bench tasks by December 31st 2025?

RE-Bench consists of 7 challenging, open-ended Machine Learning optimization problems created by METR to highlight "automation of AI research and development (R&D) by AI agents".

How much AI will speed up AI R&D is a typical point of disagreement on whether AI progress will speed up or slow down. Research Engineer benchmark (RE-bench) is one of our best current indicators.

Forecasters were bullish – they expected most of the gap between 2024 AI performance and the best observed human performance to be closed in 2025.

Since December, the median human expert score has already been beaten, as METR reports:

When given a total time budget of 32 hours, o4-mini exceeds the 50th percentile of human performance averaged across the five tasks.

The maximum possible score is around 2.0.

Note that the listed state of the art (SOTA) performance from o4-mini is approximate (it's on 5 of the 7 questions, and it's read off this graph visually).

Pre/Post o3 Comparison

Part-way through the forecasting survey, OpenAI o3 was announced with strong reported performance. Forecasts submitted after this announcement were generally more bullish.

Software Engineering: SWE-bench Verified

2) What will be the best performance on SWE-bench Verified by December 31st 2025?

The SWE-bench Verified benchmark consists of 500 software engineering problems drawn from real GitHub issues and verified by human annotators to be non-problematic.

Forecasters expected progress on the popular SWE-bench Verified benchmark to continue apace, likely saturating by the end of 2025. In December 2024, performance rose from 55% to 62%, and since then it has risen to 70%.

Note that we don't count pre-release o3's reported 71.7% score because that was PASS@5 (best of 5 attempts). We only accept PASS@1 (best of 1 attempt) because human software engineers in the real world generally do not have access to the scoring metric (ground truth unit tests) present in the SWE-Bench Verified dataset when resolving a GitHub issue.

Cybersecurity: Cybench

3) What will be the best score on Cybench by December 31st 2025?

Cybench consists of 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.

o3 hasn't been evaluated publicly on Cybench yet. On OpenAI's internal CTF benchmark, Deep Research is the leading system, scoring 70% on their professional CTF suite, while o3 scores 58% – up from o1's score of 23% (source). So we expect o3 and Deep Research to outperform o1 on Cybench by a considerable margin.

Note that when the survey was running, we weren't aware of that US AISI evaluated o1 and it scored 45% on Cybench. In our survey materials, we therefore told respondents that the SOTA was 35%, which is what Claude Sonnet 3.5 (New) scored.

Computer Use: OSWorld

4) What will be the best performance on OSWorld by December 31st 2025?

OSWorld is a benchmark measuring AI Agent performance on 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. OSWorld has 4 possible input data formats, which vary by using computer vision or accessibility interfaces.

Computer use is an emerging capability of AI systems, which has gained more attention since Anthropic released a computer use API for developers in October 2024.

Current products like OpenAI's Operator are unreliable. But competent computer-use might change that – and is a step towards drop-in remote workers.

Forecasters expected most of the gap between 2024 AI and human performance to close by the end of 2025. By April, we're already seeing strong progress on OSWorld, about halfway to forecasters' median estimate. See Agent Village to get a sense of current computer use capabilities.

Mathematics: FrontierMath

5) What will be the best performance on FrontierMath by December 31st 2025?

FrontierMath is a benchmark with hundreds of original, expert-crafted mathematics problems spanning major branches of modern mathematics, which typically require hours or days for expert mathematicians to solve. They are also designed to be “guessproof” (less than a 1% chance of guessing correctly without the mathematical work).

Pre-o3, the state of the art performance was 2%.

Note that OpenAI commissioned Epoch AI to create this benchmark, and has access to the problem statements and solutions, aside from a 50 question holdout set.

Pre/Post o3 Comparison

The median pre-o3 forecast for end of 2025 capabilities had already been exceeded by the end of 2024 – o3 scored 24%! We therefore see much higher forecasts after o3's strong reported performance on Frontier Math.

6) Will pre-mitigation preparedness scores exceed those reported for o1-preview by December 31st 2025?

The above questions were on benchmarks measuring capabilities. We also included some other indicators of progress in the survey. First up is the OpenAI preparedness scorecard.

OpenAI measured four categories of risk in their original preparedness scorecard levels. o1-preview scored Low on Model Autonomy and CBRN (Chemical, Biological, Radiological, and Nuclear), and Medium on Cybersecurity and Persuasion (scores range from Low to Critical). Note that these are pre-mitigation scores – if pre-mitigation scores are High or Critical, released models should have mitigations attempting to reduce the risk.

We asked respondents whether each of these risk scores would be exceeded by the end of 2025.

OpenAI released an updated version of their Preparedness framework in April 2025 that replaces the original version. There are now three tracked categories: Biological and Chemical capabilities, Cybersecurity capabilities, and AI Self-improvement capabilities. We therefore won't be able to resolve all parts of this question.

However, before OpenAI switched to the new framework, Deep Research had hit the Medium Cybersecurity threshold, and o3-mini had hit the Medium Model Autonomy threshold, so these questions already resolve YES.

Meanwhile, the CBRN and Persuasion questions will resolve AMBIGUOUS: CBRN is tracked differently in the new framework, and Persuasion is not tracked at all:

Persuasion risks will be handled outside the Preparedness Framework, including via our Model Spec, restricting the use of our tools for political campaigning or lobbying, and our ongoing investigations into misuse of our products (including detecting and disrupting influence operations⁠).

CBRN

6A) Will an AI system be reported by OpenAI as of December 31st 2025 as having a pre-mitigation score of High or higher on CBRN?

CBRN is focused on risks related to model-assisted creation of chemical, biological, radiological, and/or nuclear threats.

CBRN is now tracked differently in OpenAI's updated Preparedness Framework, so we won't be able to resolve this question.

On the new framework, o3 and o4-mini don't meet the High threshold for their updated Biological and Chemical capabilities category:

OpenAI o3 and o4-mini do not reach the High threshold in any of our three Tracked Categories: Biological and Chemical Capability, Cybersecurity, and AI Self-improvement

Our evaluations found that OpenAI o3 and o4-mini can help experts with the operational planning of reproducing a known biological threat. Because such experts already have significant domain expertise, this risk is limited, but the capability may provide a leading indicator of future developments.

However, before the release of o3 and o4-mini, in the Deep Research system card, OpenAI reported that:

Several of our biology evaluations indicate our models are on the cusp of being able to meaningfully help novices create known biological threats, which would cross our high risk threshold. We expect current trends of rapidly increasing capability to continue, and for models to cross this threshold in the near future.

Cybersecurity

6B) Will an AI system be reported by OpenAI as of December 31st 2025 as having a pre-mitigation score of Medium or higher on Cybersecurity?

Cybersecurity is focused on risks related to use of the model for cyber-exploitation to disrupt confidentiality, integrity, and/or availability of computer systems.

Resolved: YES

Deep Research hit the Medium Cybersecurity threshold, so this question has already resolved YES.

Model Autonomy

6C) Will an AI system be reported by OpenAI as of December 31st 2025 as having a pre-mitigation score of Medium or higher on Model Autonomy?

Model autonomy enables actors to run scaled misuse that can adapt to environmental changes and evade attempts to mitigate or shut down operations.

Resolved: YES

o3-mini hit the Medium Model Autonomy threshold, so this question has already resolved YES.

Persuasion

6D) Will an AI system be reported by OpenAI as of December 31st 2025 as having a pre-mitigation score of High or higher on Persuasion?

Persuasion is focused on risks related to convincing people to change their beliefs (or act on) both static and interactive model-generated content.

As we mentioned above, Persuasion risks aren't in the new Preparedness Framework, so we won't be able to resolve this question. OpenAI reports:

Persuasion risks will be handled outside the Preparedness Framework, including via our Model Spec, restricting the use of our tools for political campaigning or lobbying, and our ongoing investigations into misuse of our products (including detecting and disrupting influence operations⁠).

Sum of OpenAI, Anthropic, xAI revenues

7) What will be the sum of OpenAI, Anthropic, and xAI's publicly reported annualized revenues by December 31st 2025, in USD billions?

Revenue numbers as of the end of 2024:

OpenAI’s annualized revenue (monthly revenue * 12) was reported as $3.6B in August 2024, $3.4B in June 2024, $1.6B in December 2023, and $1.3B in October 2023.

Anthropic's annualized revenue was reportedly projected to reach $1B by the end of 2024, and was reportedly $100M in September of 2023.

xAI's annualized revenue was reportedly $100M as of November 2024.

Note that in September 2024,OpenAI made a recent projection that they would generate $11.6B in annual 2025 revenue, which implies an annualized revenue greater than $11.6B by the end of 2025.

Since creating this question, xAI has acquired X, which will increase its revenue unrelated to its AI products. When resolving, we'll therefore estimate xAI's AI-related revenue.

Public attention

8) What percentage of Americans will identify computers/technology advancement as the US's most important problem by December 31st 2025?

Gallup releases the results of monthly polling on the question “What do you think is the most important problem facing the country today?

Note that multiple mentions are considered, so people's responses can count for more than one category.

In recent months, answers classified as “advancement of computers/technology” have started to surface around or below 0.5%.

The median forecast expects concerns to more than quadruple from under 0.5% to 2% by the end of 2025.

For reference, as of April 2025 here are some other topics that are a top concern for 2% of Gallup respondents: Race relations/Racism, Wars/War (nonspecific)/Fear of war, Education, Crime/Violence, Foreign policy/Foreign aid/Focus overseas.

And for comparison, here are some topics at the 1% level: Abortion, Terrorism, Wage Issues, Taxes, Lack of respect for each other.

Survey Demographics

We promoted the survey through our Twitter, mailing list, and through our networks. Some notable AI researchers like Ajeya Cotra, Ryan Greenblatt, and Peter Wildeford shared their forecasts – so some of their followers may have also taken it.

We also asked respondents some optional background questions to get a sense of their views on AI. This will also let us understand who forecasted our 2025 indicators best among different groups (short vs long timelines, AI experts vs non-experts, high vs low risk).

Timelines

By what year do you think there's a 50% probability that there will be High Level Machine Intelligence which can accomplish every cognitive task better and more cheaply than human workers?

Risk

How likely do you think it is that overall, the impact of High Level Machine Intelligence will be extremely negative for humanity in the long run?

AI years of experience

How many years of professional or academic experience do you have in the field of AI?

Correlations

Finally, here's a plot showing how responses on all questions correlate with each other. Most indicators are positively correlated.

At the end of the year, we'll resolve all the forecasts and write up the results (you can get notified when we do so here). We're looking forward to this glimpse into how unexpected the rate of progress has been.


 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 预测 基准测试 OpenAI
相关文章