少点错误 03月20日
Improved visualizations of METR Time Horizons paper.
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

作者对METR的近期时间范围论文进行研究,旨在创建模型准确率达95%或99%及以上的可视化。通过分析特定时间范围桶的模型特定准确率水平分布,对6种模型的不同准确率时间差异进行记录,并制作了可视化图像,还探讨了相关趋势。

METR论文仅可视化50 - 80%准确率的任务,作者欲创建更高准确率的可视化

作者分析模型特定准确率水平分布,记录6种模型不同准确率的时间差异

作者制作了可视化图像,包括扩展的x轴及新的80% - 99%准确率趋势的短期趋势斜率

作者认为4个月翻倍的短期趋势可能过于乐观,7个月翻倍的长期趋势可能低估了进展

Published on March 19, 2025 11:36 PM GMT

TLDR: Skip to the last image at the bottom of my post to see the visualization.

I don't usually post here, but I know you folks like forecasting and timelines, so I think you might appreciate this. The recent time horizons paper by METR is great, but it only visualized tasks that models could complete with 50-80% accuracy. I think it's highly useful, and arguably even more useful, to see when capabilities would hit 95% or 99% accuracy or beyond for various time horizons.

So I went through the paper with the goal of getting the information that will allow such a visualization to be created, and I found distributions shown of model-specific accuracy levels for specific time horizon buckets.



I noted these down and noticed there is similar, if not identical slopes for those higher accuracy levels compared to the lower ones when plotted against time horizons(they also noted in the paper how the slope for 80% accuracy was conveniently very similar to the slope for 50% accuracy too), So this makes the visualization math easier as well, I keep the slope steepness the same while just needing to apply the appropriate Y-axis adjustment relative to release date.

I noted down the multiple difference between the time horizons of 50% accuracy, 80% accuracy, 95% accuracy and 99% accuracy, and did this for the following 6 models:
- Claude-3.7-sonnet
- Claude-3.5-sonnet(new)
- O1
- GPT-4o
- Claude-3-Opus
- GPT-4-0314

The paper mentioned a 5X multiple difference they found between 50% and 80% accuracy, and when I averaged my numbers for these 6 models I arrived at the same figure for 50% to 80%, as for the other accuracy multiples, it looks like this:



Finally, here is the visualization I ended up making, with expanded x-axis, as well as the short-term trend slope added of reasoning models for each of the new 80%-99% accuracy trends, which was also mentioned in the original paper (doubling every 4 months with the data points of 2024-2025)



I think there is good reason to believe that the short term trend of 4 month doublings may be over optimistic, but I can also see many arguments as to why the long term 7 month doubling trend is under-estimating progress. So I think it's good to include both here, some even believe the trend can go beyond the 4 month doubling soon, and frankly I can see that being reasonably possible too.

Edit: I see people are upvoting but not commenting. As this is one of my first posts here I would appreciate if anyone is willing to give feedback on my formatting, whether there is better ways I can represent images, or even just about how this maybe influences your timelines? (or even reasons as to why it doesn't influence your timelines).


 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

模型准确率 可视化 时间差异 趋势分析
相关文章