MarkTechPost@AI 02月08日
Singapore University of Technology and Design (SUTD) Explores Advancements and Challenges in Multimodal Reasoning for AI Models Through Puzzle-Based Evaluations and Algorithmic Problem-Solving Analysis
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

新加坡科技与设计大学(SUTD)的研究人员对OpenAI的GPT-[n]和o-[n]模型系列在多模态谜题解决任务中的表现进行了系统评估。该研究旨在识别AI在感知、抽象推理和问题解决技能方面的差距。研究人员使用了PuzzleVQA和AlgoPuzzleVQA数据集,包括抽象视觉谜题和算法推理挑战,对GPT-4-Turbo、GPT-4o和o1等模型进行了比较。结果表明,模型在推理能力方面取得了稳步提升,但仍难以进行精确的视觉解释,并且对答案提示存在很强的依赖性。此外,感知仍然是所有模型面临的主要挑战,当提供明确的视觉细节时,准确率显著提高。尽管o1在数值推理方面表现出色,但在处理基于形状的谜题时遇到了困难,计算成本也大幅增加。

🧩 PuzzleVQA专注于抽象视觉推理,要求模型识别数字、形状、颜色和大小中的模式。而AlgoPuzzleVQA则提出算法问题解决任务,需要逻辑演绎和计算推理。

📈 研究观察到从GPT-4-Turbo到GPT-4o和o1的推理能力呈现显著上升趋势。虽然GPT-4o表现出适度增长,但过渡到o1带来了显著改进,但与GPT-4o相比,计算成本增加了750倍。

📊 在PuzzleVQA中,o1在多项选择设置中实现了79.2%的平均准确率,超过了GPT-4o的60.6%和GPT-4-Turbo的54.2%。然而,在开放式任务中,所有模型的性能均出现下降,o1得分为66.3%,GPT-4o为46.8%,GPT-4-Turbo为38.6%。

👁️‍🗨️ 研究发现感知是所有模型的主要限制。注入明确的视觉细节可将准确率提高22%–30%,表明模型依赖于外部感知辅助。归纳推理指导进一步提高了6%–19%的性能,尤其是在数字和空间模式识别中。

After the success of large language models (LLMs), the current research extends beyond text-based understanding to multimodal reasoning tasks. These tasks integrate vision and language, which is essential for artificial general intelligence (AGI). Cognitive benchmarks such as PuzzleVQA and AlgoPuzzleVQA evaluate AI’s ability to process abstract visual information and algorithmic reasoning. Even with advancements, LLMs struggle with multimodal reasoning, particularly pattern recognition and spatial problem-solving. High computational costs compound these challenges.

Prior evaluations relied on symbolic benchmarks such as ARC-AGI and visual assessments like Raven’s Progressive Matrices. However, these do not adequately challenge AI’s ability to process multimodal inputs. Recently, datasets like PuzzleVQA and AlgoPuzzleVQA have been introduced to assess abstract visual reasoning and algorithmic problem-solving. These datasets require models integrating visual perception, logical deduction, and structured reasoning. While previous models, such as GPT-4-Turbo and GPT-4o, demonstrated improvements, they still faced limitations in abstract reasoning and multimodal interpretation.

Researchers from the Singapore University of Technology and Design (SUTD) introduced a systematic evaluation of OpenAI’s GPT-[n] and o-[n] model series on multimodal puzzle-solving tasks. Their study examined how reasoning capabilities evolved across different model generations. The research aimed to identify gaps in AI’s perception, abstract reasoning, and problem-solving skills. The team compared the performance of models such as GPT-4-Turbo, GPT-4o, and o1 on PuzzleVQA and AlgoPuzzleVQA datasets, including abstract visual puzzles and algorithmic reasoning challenges.

The researchers conducted a structured evaluation using two primary datasets: 

    PuzzleVQA: PuzzleVQA focuses on abstract visual reasoning and requires models to recognize patterns in numbers, shapes, colors, and sizes. AlgoPuzzleVQA: AlgoPuzzleVQA presents algorithmic problem-solving tasks that demand logical deduction and computational reasoning. 

The evaluation was carried out using both multiple-choice and open-ended question formats. The study employed a zero-shot Chain of Thought (CoT) prompting for reasoning and analyzed the performance drop when switching from multiple-choice to open-ended responses. The models were also tested under conditions where visual perception and inductive reasoning were separately provided to diagnose specific weaknesses.  

The study observed steady improvements in reasoning capabilities across different model generations. GPT-4o showed better performance than GPT-4-Turbo, while o1 achieved the most notable advancements, particularly in algorithmic reasoning tasks. However, these gains came with a sharp increase in computational cost. Despite overall progress, AI models still struggled with tasks that required precise visual interpretation, such as recognizing missing shapes or deciphering abstract patterns. While o1 performed well in numerical reasoning, it had difficulty handling shape-based puzzles. The difference in accuracy between multiple-choice and open-ended tasks indicated a strong dependence on answer prompts. Also, perception remained a major challenge across all models, with accuracy improving significantly when explicit visual details were provided.

In a quick recap, the work can be summarized in a few detailed points:

    The study observed a significant upward trend in reasoning capabilities from GPT-4-Turbo to GPT-4o and o1. While GPT-4o showed moderate gains, the transition to o1 resulted in notable improvements but came at a 750x increase in computational cost compared to GPT-4o.  Across PuzzleVQA, o1 achieved an average accuracy of 79.2% in multiple-choice settings, surpassing GPT-4o’s 60.6% and GPT-4-Turbo’s 54.2%. However, in open-ended tasks, all models exhibited performance drops, with o1 scoring 66.3%, GPT-4o at 46.8%, and GPT-4-Turbo at 38.6%.  In AlgoPuzzleVQA, o1 substantially improved over previous models, particularly in puzzles requiring numerical and spatial deduction. o1 scored 55.3%, compared to GPT-4o’s 43.6% and GPT-4-Turbo’s 36.5% in multiple-choice tasks. However, its accuracy declined by 23.1% in open-ended tasks.  The study identified perception as the primary limitation across all models. Injecting explicit visual details improved accuracy by 22%–30%, indicating a reliance on external perception aids. Inductive reasoning guidance further boosted performance by 6%–19%, particularly in numerical and spatial pattern recognition.  o1 excelled in numerical reasoning but struggled with shape-based puzzles, showing a 4.5% drop compared to GPT-4o in shape recognition tasks. Also, it performed well in structured problem-solving but faced challenges in open-ended scenarios requiring independent deduction.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

The post Singapore University of Technology and Design (SUTD) Explores Advancements and Challenges in Multimodal Reasoning for AI Models Through Puzzle-Based Evaluations and Algorithmic Problem-Solving Analysis appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态推理 AI模型评估 视觉推理 算法问题求解
相关文章