TechCrunch News 02月17日
These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项最新研究利用NPR的Sunday Puzzle节目中的谜题,创建了一个AI基准测试,旨在评估AI的推理和问题解决能力。该测试揭示了一些有趣的现象,例如OpenAI的o1等推理模型有时会“放弃”,给出明知不正确的答案。与传统的需要专业知识的AI测试不同,Sunday Puzzle的优势在于它考察的是常识和推理能力,避免了模型通过“死记硬背”来解决问题。研究人员希望通过这个基准测试,更广泛地评估AI模型的能力,并促进AI技术的进步,让更多人了解AI的潜力和局限性。

🧩 Sunday Puzzle谜题被用作AI基准测试,因其考察常识和推理能力,而非专业知识,更贴近普通用户的需求。

🤔 研究发现,即使是强大的推理模型如OpenAI的o1,在面对Sunday Puzzle的难题时,也会出现“放弃”并给出错误答案的情况,甚至表现出类似人类的“沮丧”情绪。

📈 目前,o1模型在该基准测试中表现最佳,得分为59%,其次是o3-mini,得分为47%。研究人员计划扩大测试范围,评估更多推理模型,以发现其潜在的改进空间。

📻 Sunday Puzzle作为基准测试的优势在于其公开性、时效性以及对常识和推理能力的考察,避免了模型通过预先训练来作弊的可能性。

Every Sunday, NPR host Will Shortz, The New York Times’ crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the Sunday Puzzle. While written to be solvable without too much foreknowledge, the brainteasers are usually challenging even for skilled contestants.

That’s why some experts think they’re a promising way to test the limits of AI’s problem-solving abilities.

In a recent study, a team of researchers hailing from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor created an AI benchmark using riddles from Sunday Puzzle episodes. The team says their test uncovered surprising insights, like that reasoning models — OpenAI’s o1, among others — sometimes “give up” and provide answers they know aren’t correct.

“We wanted to develop a benchmark with problems that humans can understand with only general knowledge,” Arjun Guha, a computer science faculty member at Northeastern and one of the co-authors on the study, told TechCrunch.

The AI industry is in a bit of a benchmarking quandary at the moment. Most of the tests commonly used to evaluate AI models probe for skills, like competency on PhD-level math and science questions, that aren’t relevant to the average user. Meanwhile, many benchmarks — even benchmarks released relatively recently — are quickly approaching the saturation point.

The advantages of a public radio quiz game like the Sunday Puzzle is that it doesn’t test for esoteric knowledge, and the challenges are phrased such that models can’t draw on “rote memory” to solve them, explained Guha.

“I think what makes these problems hard is that it’s really difficult to make meaningful progress on a problem until you solve it — that’s when everything clicks together all at once,” Guha said. “That requires a combination of insight and a process of elimination.”

No benchmark is perfect, of course. The Sunday Puzzle is U.S. centric and English only. And because the quizzes are publicly available, it’s possible that models trained on them can “cheat” in a sense, although Guha says he hasn’t seen evidence of this.

“New questions are released every week, and we can expect the latest questions to be truly unseen,” he added. “We intend to keep the benchmark fresh and track how model performance changes over time.”

On the researchers’ benchmark, which consists of around 600 Sunday Puzzle riddles, reasoning models such as o1 and DeepSeek’s R1 far outperform the rest. Reasoning models thoroughly fact-check themselves before giving out results, which helps them avoid some of the pitfalls that normally trip up AI models. The trade-off is that reasoning models take a little longer to arrive at solutions — typically seconds to minutes longer.

At least one model, DeepSeek’s R1, gives solutions it knows to be wrong for some of the Sunday Puzzle questions. R1 will state verbatim “I give up,” followed by an incorrect answer chosen seemingly at random — behavior this human can certainly relate to.

The models make other bizarre choices, like giving a wrong answer only to immediately retract it, attempt to tease out a better one, and fail again. They also get stuck “thinking” forever and give nonsensical explanations for answers, or they arrive at a correct answer right away but then go on to consider alternative answers for no obvious reason.

“On hard problems, R1 literally says that it’s getting ‘frustrated,’” Guha said. “It was funny to see how a model emulates what a human might say. It remains to be seen how ‘frustration’ in reasoning can affect the quality of model results.”

R1 getting “frustrated” on a question in the Sunday Puzzle challenge set.Image Credits:Guha et al.

The current best-performing model on the benchmark is o1 with a score of 59%, followed by the recently released o3-mini set to high “reasoning effort” (47%). (R1 scored 35%.) As a next step, the researchers plan to broaden their testing to additional reasoning models, which they hope will help to identify areas where these models might be enhanced.

The scores of the models the team tested on their benchmark.Image Credits:Guha et al.

“You don’t need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that don’t require PhD-level knowledge,” Guha said. “A benchmark with broader access allows a wider set of researchers to comprehend and analyze the results, which may in turn lead to better solutions in the future. Furthermore, as state-of-the-art models are increasingly deployed in settings that affect everyone, we believe everyone should be able to intuit what these models are — and aren’t — capable of.”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI基准测试 Sunday Puzzle 推理模型 问题解决 人工智能
相关文章