少点错误 01月20日
Some lessons from the OpenAI-FrontierMath debacle
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI最新模型o3在数学推理方面取得了巨大进展,特别是在FrontierMath基准测试中,从之前的2%提升至25%。然而,该测试的透明度问题引发了争议。Epoch AI在开发此基准时,未完全公开其与OpenAI的资金关系和数据访问权限,导致OpenAI在模型发布前获得了独家访问权。尽管OpenAI声称未利用此数据训练模型,但其在推理验证中可能存在的优势引发了对AI评估、基准和安全性的担忧。未来应确保基准的透明度和数据使用协议,避免类似问题再次发生。

💰Epoch AI在开发FrontierMath基准时,未明确告知贡献者其资金来源和数据访问权限,导致信息不透明。

📊OpenAI在发布o3模型前,获得了FrontierMath基准的独家访问权,这使得其在测试中可能存在不公平的优势。

📝FrontierMath基准包含不同难度的问题,而OpenAI的25%得分可能主要来自较简单的部分,这使得其成就的真实性受到质疑。

🛡️未来AI基准的开发应确保完全透明,包括资金来源、数据访问权限和数据使用协议,以避免类似争议。

Published on January 19, 2025 9:09 PM GMT

Recently, OpenAI announced their newest model, o3, achieving massive improvements over state-of-the-art on reasoning and math. The highlight of the announcement was that o3 scored 25% on FrontierMath, a benchmark comprising hard, unseen math problems of which previous models could only solve 2%. The events afterward highlight that the announcements were, unknowingly, not made completely transparent and leave us with lessons for future AI benchmarks, evaluations, and safety.

The Events

These are the important events that happened in chronological order:

From their latest version of the paper on arxiv (link).

 

Further Details

While all of this was felt concerning even during the update in December, recent events have thrown more light on the implicaitions of this.

Other reasons why this is bad

Let's analyse how much of an advantage this access is, and how I believe it was possibly used. I was surprised when I found a number of other things about the benchmark during a presentation by someone who worked on FrontierMath:

 

Firstly, the benchmark consists of problems on three tiers of difficulty -- (a) 25% olympiad level problems, (b) 50% mid-difficulty problems, and (c) 25% difficult problems that would take an expert mathematician in the domain a few weeks to solve.

What this means, first of all, is that the 25% announcement, which did NOT reveal the distribution of easy/medium/hard problem tiers, was somewhat misleading. It may be possible that most problems o3 solved were from the first tier, which is not as groundbreaking as solving problems from the hardest tier of the benchmark, which it is known most for.

 

Secondly, OpenAI had complete access to the problems and solutions for most of the problems. This means they could have, in theory, trained their models to solve them. However, they verbally agreed not to do so, and frankly I don't think they would have done that anyway, simply because this is too valuable a dataset to memorize.

 

How OpenAI could use this, without actually training o3 on it explicitly

 

Now, nobody really knows what goes on behind o3 (this post has some hypotheses about o1), but if they follow the "thinking" kind of inference-scaling used by models published by other frontier labs that possibly uses advanced chain-of-thought and recursive self-improvement combined with a MCMC-search using a PRM verifier, FrontierMath could be a golden opportunity to validate the PRM on. (note that I don't claim it was used this way -- but I just want to emphasize that a verbal agreement of not training is not enough).

 

Fig. 2 from this paper by Google DeepMind on scaling inference-time compute for vastly improved reasoning capabilities.

Quite simply,  a model that does inference-time compute scaling could greatly benefit from a process-verifier reward model for lookahead search on the output space, and such benchmarks could be really good-quality data to validate universal, generalizable reasoning verifiers on - a really hard task to otherwise get right.

 

Implications on AI Safety

 

Epoch AI works on investigating the trajectory of AI for the benefit of society, and several people involved could be concerned with AI safety. Also, a number of the mathematicians who worked on FrontierMath would possibly have not contributed to this if they knew about the funding and exclusive access. It feels concerning that OpenAI could have inadvertently paid people to contribute to capabilities when they might not have wanted to.

 

Recent open conversations about the situation

 

Also, in this shortform, Tamay from Epoch AI replied to this (which I really appreciate):

 

This was in reply in a comment thread in meemi's Shortform:

 

However, we still haven't discussed everything, including concrete steps and the broader implications this has for AI benchmarks, evaluations, and safety. 

 

Expectations from future work

 

Specifically, these are the things we need to be careful about in the future:

I've been thinking about this last part a lot, especially the ways in which various approaches can contribute indirectly toward capabilities, despite careful considerations made to not do so directly.

 

Note: A lot of these ideas and hypotheses came out of  private conversations with other researchers about the public announcements. I take no credit for this, and I may be wrong in my hypotheses about how exactly the dataset could have possibly been used. However I do see value in more transparent discussions around this, hence this post.

 

PS: I completely support Epoch AI and their amazing work on studying the trajectory and governance of AI, and I deeply appreciate their coming forward and accepting the concerns and willingness to do the best that can be done to fix things in the future.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI o3模型 FrontierMath AI安全 透明度
相关文章