少点错误 01月20日
Broader implications of the OpenAI-FrontierMath debacle
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI最新模型o3在数学推理上的突破性进展引发热议,其在Epoch AI的FrontierMath基准测试中取得25%的成绩,远超先前模型的2%。然而,事件后续揭示,OpenAI事先获得了测试题及答案,这使得测评的公正性受到质疑。Epoch AI由OpenAI资助,并向其提供了难题及答案,此举引发了对AI基准测试、评估和安全性的广泛担忧。该事件暴露出AI研究中透明度和伦理的重要性,以及在追求技术突破时可能出现的潜在风险。

🚩OpenAI的o3模型在FrontierMath测试中取得25%的成绩,远超之前模型的2%,但其事先获取了测试题和答案。

💰Epoch AI接受OpenAI资助,并向其提供了FrontierMath中的大部分难题及答案,这导致测评结果的公正性受到质疑。

🤔FrontierMath测试题分为三个难度等级,而OpenAI的25%得分可能仅来自于较低难度的题目,这使得其宣称的突破性进展可能被夸大。

📝未来AI测评基准应确保透明度,包括资金来源、数据访问权限以及数据使用协议,以避免类似事件再次发生。

🛡️在进行AI安全研究时,需要警惕“造盾者亦能造矛”的风险,避免安全研究反而促进了潜在危险技术的发展。

Published on January 19, 2025 9:09 PM GMT

Recently, OpenAI announced their newest model, o3, achieving massive improvements over state-of-the-art on reasoning and math. The highlight of the announcement was the fact that o3 scored 25% on FrontierMath, a benchmark by Epoch AI ridiculously hard, unseen math problems of which previous models could only solve 2%. The events after the announcement, however, highlight that apart from OpenAI having the answer sheet before taking the exam, this was shady and lacked transparency in every possible way and has way broader implications for AI benchmarking, evaluations, and safety.

 

These are the important events that happened in chronological order:

 

Let's analyze how much of an advantage this access is, and how I believe it was possibly used.

I was completely dumbfounded when I found a number of other things about the benchmark during a presentation by a lead mathematician who worked on FrontierMath at a seminar:

Firstly, the benchmark consists of problems on three tiers of difficulty -- (a) 25% olympiad level problems, (b) 50% mid-difficulty problems, and (c) 25% difficult problems that would take a math Ph.D. student in the domain a few weeks to solve.

What this means, first of all, is that the 25% announcement, which did NOT reveal the distribution of easy/medium/hard problem tiers, was entirely misleading. It might as well be possible that o3 solved problems only from the first tier, which is nowhere near as groundbreaking as solving the harder problems from the benchmark.

 

Secondly, OpenAI had complete access to the problems and solutions to most of the problems. This means they could have actually trained their models to solve it. However, they verbally agreed not to do so, and frankly I don't think they would have done that anyway, simply because this is too valuable a dataset to memorize. 

Now, nobody really knows what goes on behind o3, but if they follow the kind of "thinking", inference-scaling of search-space models published by other frontier labs that possibly uses advanced chain-of-thought and introspection combined with a MCMC-rollout on the output distributions with a PRM-style verifier, FrontierMath is a golden opportunity to validate on.

Fig. 2 from this paper by Google DeepMind on scaling inference-time compute for vastly improved reasoning capabilities.

Quite simply,  a model that learns inference-time reasoning as shown in this paper would greatly benefit from a good process-verifier reward model against which it can do a lookahead search on the output space, and such datasets are really good-quality data to validate universal, generalizable reasoning verifiers on - a really hard task to otherwise get right.

 

It is noteworthy that Epoch AI works on "investigating the trajectory of AI for the benefit of society", and a lot of the people funding them and working there are aligned to AI safety (except maybe OpenAI now). A number of the mathematicians who worked on FrontierMath are also aligned to AI safety, and would possibly have not contributed to this if they knew.

 

It is completely outrageous that OpenAI could pull this off, where they could have, in theory, paid safety-aligned people to contribute to capabilities without them even realizing this, and is cause of a lot of concern and discussion. In hindsight, it is quite obvious why OpenAI would want to conveniently hide this fact (think money at that scale).

Also, in this shortform, Tamay from Epoch AI accepted their mistake, and promised to be more transparent about this:

 

This was in reply in a comment thread in meemi's Shortform:

 

However, we still haven't discussed everything, including concrete steps and the broader implications this has for AI benchmarks, evaluations, and safety. Specifically, these are the things we need to be super-seriously careful about in the future:

 

Note: A lot of these ideas and hypotheses came out of private investigations, connecting the dots on public announcements and private conversations I had with others in the Berkeley alignment community in Lighthaven as part of the MATS program. I take no credit for connecting all the dots, however I do see lots of value in more transparent discussions around this, and I have been deeply concerned about this, hence this post.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenAI AI测评 FrontierMath 透明度 AI安全
相关文章