少点错误 06月23日 21:47
Forecasting AI Forecasting
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了人工智能(AI)预测能力的快速进步,主要基于ForecastBench、Metaculus AI竞赛以及其他预测平台的数据。研究表明,AI预测技能正在稳步提高,但要达到顶尖人类预测者的水平,时间线仍存在很大不确定性,可能很快实现,也可能需要数年时间。文章分析了训练计算、模型发布日期与预测准确性的关系,并评估了AI在预测竞赛中的表现,最终对AI预测的未来发展进行了展望。

📈 **ForecastBench基准测试**:ForecastBench是一个动态更新的基准,用于衡量机器学习系统在不断变化的预测问题上的准确性。研究表明,AI预测性能与Chatbot Arena分数和训练计算量(以FLOP衡量)呈线性相关。Grok 3的训练计算量已非常接近预测值,可能预示着AI在不久的将来达到超级预测者的水平。

🗓️ **模型发布日期与预测性能**:通过分析模型发布日期与预测性能的关系,研究发现,AI模型在Overall Score和Overall Resolved Score上达到超级预测者水平的预测时间分别为2028年12月19日和2034年5月5日。这表明,在仅考虑已解决问题的情况下,预测达到超级预测者水平的时间可能会更长。

🏆 **Metaculus AI竞赛**:Metaculus举办的季度AI预测竞赛直接将AI机器人与人类专家进行对比。Q4 2024的分析显示,最佳AI团队的预测结果仍逊于Metaculus Pro预测者,但相比Q3有所进步。Q1 2025的结果显示,metac-o1取得了第一名,表明AI预测技术,包括提示、工具访问、微调和预测聚合方法,正在不断改进。

Published on June 23, 2025 1:39 PM GMT

TL;DR: This post examines how rapidly AI forecasting ability is improving by analyzing results from ForecastBench, Metaculus AI tournaments, and various prediction platforms. While AI forecasting is steadily improving, there remains significant uncertainty around the timeline for matching top human forecasters—it might be very soon or take several years.

ForecastBench

ForecastBench is a “dynamic, continuously-updated benchmark designed to measure the accuracy of ML systems on a constantly evolving set of forecasting questions.” This makes it ideal for tracking how fast AI forecasting skills improve. Performance is measured as Brier scores (lower is better)[1]. See the accompanying benchmark paper here.

The leaderboard shows AI performance alongside human benchmarks: "Public median forecast" represents non-experts, while "Superforecaster median forecast" represents elite human forecasters.

Evaluated AI systems operate under varying conditions—with or without access to a scratchpad for reasoning or to relevant information like news. Importantly, current forecasting prompts are likely good but suboptimal, meaning AI capabilities may be underestimated.

The benchmark paper reveals that AI forecasting performance correlates linearly with both Chatbot Arena scores and log training compute (measured in FLOP). Training compute may be of special interest for predicting when AIs reach the level of expert human forecasters.

Performance vs. Chatbot Arena and Log Training Compute

 

Brier Score vs. Chatbot Arena and Log training compute (source: ForecastBench paper)

Brier Score vs. Chatbot Arena and Log training compute (source: ForecastBench paper)

Scores in the graph above corresponds to accuracy for each model with optimal conditions (e.g. if Claude-3-5-Sonnet performed better with a scratchpad, its score is the one achieved while using it).

Using Epoch AI's training compute data, the AIs with most training compute at each point in time show a 4.9x annual increase in training compute. The current leader is Grok 3, which used an estimated 4.6×10^26 FLOPs.

In the ForecastBench paper, the authors extrapolate the trend line for prediction accuracy and training compute:

Projecting out the log-linear relationship, we find that LLMs could match superforecaster performance when training compute approaches 6.49 × 10^26, though there is a large confidence interval (bootstrapped 95% CI: 9.69×10^25–8.65×10^28) given the marginally significant relationship (r = −0.67, p = 0.046).

The training compute for Grok 3 (4.6×10^26 FLOP) is already remarkably close to the projected 6.49 × 10^26! Even the upper bound of the bootstrapped confidence interval would be reached by June 2028, if training compute continues to increase at the same exponential rate[2].

Performance vs. Release Date

Looking at release dates rather than training compute tells a different story. Using leaderboard data from June 8, 2025, this graph shows AI performance against model release dates.

Overall Score vs release date (leaderboard data from 8 June 2025)

Overall Score vs release date (leaderboard data from 8 June 2025)

As with the previous graphs, scores correspond to the best conditions for each AI model.

The graph above shows Overall Scores from the leaderboard, which incorporates accuracy estimates on questions that are yet to be resolved—through comparing forecasts to crowd predictions. This may result in somewhat less reliable scores, so it seems prudent to also examine the Overall Resolved Scores, which ignore unresolved questions:

Overall Resolved Score vs release date (leaderboard data from 8 June 2025)

Overall Resolved Score vs release date (leaderboard data from 8 June 2025)

The resulting graph is remarkably similar. GPT-4 dominated for a long time, until the arrival of GPT-4-Turbo. Note that AIs are approaching the public median score, for both Overall Score and Overall Resolved Score. Since AIs have improved much since Claude 3.5 Sonnet[3], frontier models would likely surpass the public median forecast.

The red line in each graph is fit to the running maximum score. In the following graphs, the lines are projected forward until they intersect with the superforecaster score—just keep in mind that the projections will surely be wildly off since the lines were fit on only three data points.

The intersection dates are 2028-12-19 for Overall Score, and 2034-05-05 for Overall Resolved Score.

While the projections aren’t exactly reliable, the fact that the projection is much longer when only considering accuracy on resolved questions is quite interesting. When we get results for later models, I suspect this result will hold.

Metaculus AI Tournaments

Metaculus runs quarterly AI forecasting tournaments, directly pitting bots against human experts. While Q1 2025 results are available for bots, human baseline comparisons haven't been released yet.

The Q4 2024 analysis showed the best AI team (ironically just one bot called "pgodzinai") achieved a head-to-head score of -8.6 against median Metaculus Pro Forecaster predictions—an improvement from -11.3 in Q3 2024[4]. A negative head-to-head score implies worse performance compared to the Pros, while 0 represents equal accuracy.

The forecasting bots are scored using Total Spot Peer Scores. Peer score calculates the average difference in log score between a participant and all other forecasters on each question, ensuring the average Peer score is 0. Spot peer score is the peer score evaluated when the Community Prediction is revealed (hidden from bots to prevent gaming). Total Spot Peer Score (or just Total Spot Score) sums all individual spot peer scores.

While anyone can make a bot and participate in the tournaments, there are a collection of bots that uses the same framework, providing baseline scores for specific AI models. These bots are differentiated from other participants by their names, which is always metac-[model name] (for instance, metac-o1 only uses OpenAI’s o1 model for predictions).

Plotting baseline scores for Q1 2025 against release date shows steady improvement over time. Baseline scores for Perplexity and Exa AI were also provided but are excluded here due to unclear release dates.

We can compare Metaculus and ForecastBench scores for the two overlapping AI models:

Two data points prevent reliable conclusions, but if these scores accurately represent forecasting capabilities, small Brier score improvements correspond to large Total Spot Score gains. Even though o1 scores several thousand points higher than Claude-3.5-sonnet-20240620 on Metaculus, o1 probably only slightly exceeds public median forecasts on ForecastBench (if at all), if tested under the same conditions as the other ForecastBench models.

Q2 2025 tournament early results (screenshot from June 17) show the trend continuing, with later models like o3 and DeepSeek-R1 outperforming predecessors. Note these results aren't final since many questions remain unresolved, and scores aren't comparable across tournament iterations due to scoring being relative to the performance of other participants.

CumulativeBot performs well but doesn't dramatically exceed metac-o1's performance—not the leap required to go from roughly public level to superforecaster level, as far as I can tell.

In Q1 2025, metac-o1 achieved first place. The absence of metac bots leading Q2 indicates that AI forecasting techniques are improving, including prompting, tool access, fine-tuning, and aggregation methods for combining predictions made by several AI models into a single bot prediction.

Forecasting Platform Predictions

Related predictions from forecasting platforms, in roughly chronological order:

Wrapping it up

We ave a lot of interesting results:

What conclusions can we draw from this? The trends point in different directions. Training compute trends suggest superforecaster AIs very soon, while release date trends for forecasting score on resolved questions suggest such AIs as late as 2034 (though this doesn’t account for framework improvements).

We’ll know more once ForecastBench have results on more recent models, and when we have head-to-head scores on the Q1 and Q2 2025 Metaculus tournaments.

Currently, my median estimate is at mid-2026 for the best forecasting bots reaching superforecaster levels, with an 80% confidence interval from October 2025 to 2030. Considering only the above information would suggest longer timelines, but I suspect performance may increase rapidly through enhanced scaffolding as more attention and resources focus on AI forecasting ability.

I also expect AI capability trends—including forecasting skills—to accelerate when AIs automate larger portions of AI R&D, initiating an intelligence explosion that makes longer projections (like 2034) seem unlikely[5].

Hopefully, we achieve superhuman forecasting before such rapid capability improvements—it may prove vital for decision-making and strategizing to handle rapid societal changes and extreme risks from advanced AIs.


Thank you for reading! If you found value in this post, consider subscribing!

  1. ^

    The Brier score of a prediction is given by (1-P(outcome))^2, where P(outcome) is the probability assigned to the actual outcome. Brier score for a set of predictions is the average Brier score for each forecast.

  2. ^

    Grok 3 was released February 17, 2025, with 4.6×10^26 FLOP in training compute. Let y be the number of years between the release of Grok 3, and the date for when extrapolated training compute hits 8.65×10^28 (starting from the compute for Grok 3 and increasing by 4.9x per year). We then have:

    4.6×10^26 x 4.9^y= 8.65×10^28,
    y = log(8.65×10^28 / 4.6×10^26) / log(4.9) = 3.3

    3.3 years from the release of Grok 3 is around June 2028.

  3. ^

    For instance, Sonnet 4 achieves 72.2% pass@1 on the programming benchmark SWE-bench verified, and 75.4% on the graduate level reasoning benchmark GPQA Diamond, while the Sonnet 3.5 version scored in ForecastBench achieves 33.4% on SWE-bench and 59.4% on GPQA Diamond.

  4. ^

    Head-to-head score is calculated as Peer score—which compares one forecaster to all other forecasters with predictions on the same question—but with only two participants. More info here.

  5. ^

    This is essentially what happens in the AI 2027 scenario—where the intelligence explosion takes place in 2027. The best AI reaches the forecasting ability score 1.0 in May 2026, between the level of a human amateur (score 0-1) and pro (score 1-2). It reaches the score 2.1 in August 2027, better than any human.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 预测 ForecastBench Metaculus AI预测
相关文章