少点错误 2024年07月27日
A Visual Task that's Hard for GPT-4o, but Doable for Primary Schoolers
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型在解决儿童数学问题方面的能力,以一个简单的儿童数学竞赛问题为例,发现 GPT-4o 在解决该问题时表现不佳。作者分析了模型的错误原因,认为模型缺乏空间推理能力以及对问题结构的理解,并通过构建一个生成器,测试了 GPT-4o 在解决此类问题上的准确率,结果表明模型的准确率远低于随机基线。作者认为,虽然大型语言模型在其他领域表现出色,但在解决这类基础数学问题上仍然存在缺陷,并期待未来模型的改进能够克服这些挑战。

🤔 GPT-4o 在解决一个简单的儿童数学竞赛问题时表现不佳,准确率仅为 0.12,远低于随机基线。

🧐 模型的错误原因在于缺乏空间推理能力和对问题结构的理解。模型没有理解到孔的位置与物体位置之间的对应关系,也没有理解到对称性,即左侧孔对应右侧物体。

💡 作者通过构建一个生成器,测试了 GPT-4o 在解决此类问题上的准确率,并呼吁其他研究者参与测试,以探索不同模型在解决此类问题上的表现。

🚀 作者认为,虽然大型语言模型在其他领域表现出色,但在解决这类基础数学问题上仍然存在缺陷,并期待未来模型的改进能够克服这些挑战。

Published on July 26, 2024 5:51 PM GMT

Sure, transformers can get silver at the IMO. But before we address those two remaining problems that still stand between AlphaProof and gold, let's go a few steps back and see how transformers are doing at supposedly low-level math competitions. In particular, we will consider a problem aimed at grades 3 and 4, from the largest international math exam competition around (the Mathematical Kangaroo).

The problem is designed to test children's spatial thinking skills and rightfully held up as a good example in an interview with the competition's president Meike Akveld, a knot theorist at ETH. This article is where I came across the idea as well.

It goes like this:

In the following image, you can see an open picture book. There are holes in its cover. On the first page, there are pictures of objects. Which pictures are visible when the book is closed?

To remove some ambiguity, we also give the list of objects choose from, so in this case "Ginger", "Kiwi", "Pineapple", "Pear", "Watermelon". OK then, GPT-4o, what do you think?

To determine which pictures will be visible when the book is closed, we need to align the holes on the cover with the pictures on the first page.

From the image:

1. The holes in the cover are located in the second and third columns from the left.
2. The pictures of objects on the first page — from left to right — are: pineapple, watermelon, ginger, kiwi, pear.

Thus, when the book is closed:
- The second column will show the first image: pineapple.
- The third column will show the second image: watermelon.

ANSWER: pineapple, watermelon

Hm, that's wrong! The answer is {"Pineapple", "Ginger", "Pear"}. Then again, there are some good ideas in the completion, namely the basic setup of the problem being correctly summarized and the pictures identified.
Now, it is not so surprising that vision transformers are bad at grids, so let us grant the error about the hole locations. But the model also doesn't understand the symmetry; that a hole further to the left reveals objects further to the right, which is what the task wants to test in children. 

Investigating whether this failure mode repeats across different images and iterations (seemingly it does), I made a generator for these problems and sampled 50 of them, resulting in an accuracy of 0.12 for GPT-4o with Chain of Thought, where a random baseline gives 0.04. [The code is available here] and you are cordially invited to play around with it. It should be easy to rescore since it's using AISI's new Inspect eval framework, which I can tentatively recommend over the alternatives. I wonder how other models do on this(, like Claude 3.5, which I had some trouble getting access to). 

This really should not be a fundamentally hard problem, and I expect an extensively fine-tuned GPT-4o to do significantly better (>0.6).[1] It remains puzzling to me though how far current models seem to be from a consistently good 0-shot performance while performing so well in other areas, echoing the ARC benchmark but perhaps less controversial in regards to whether humans are good at it. Just from the general speed of progress, I also expect this being solved (>0.95) by a general-purpose multimodal model soon. Only, before or after IMO?

 

 

  1. ^

    Again attempts at this are very much invited, I just don't know of any vision finetuning API.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 儿童数学 空间推理 数学竞赛 GPT-4o
相关文章