MarkTechPost@AI 02月25日
This AI Paper from Menlo Research Introduces AlphaMaze: A Two-Stage Training Framework for Enhancing Spatial Reasoning in Large Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Menlo Research的研究人员提出了一种名为AlphaMaze的双阶段训练框架,旨在提高大型语言模型(LLM)的空间推理能力。该框架结合了监督微调(SFT)和群体相对策略优化(GRPO),以改善迷宫导航中的决策。首先,模型接触到一个经过整理的token化迷宫表示数据集,学习逐步移动序列。随后,应用GRPO来优化顺序决策,鼓励结构化推理。实验结果表明,该框架显著提高了迷宫求解的准确性,为LLM在现实世界应用中提供高级空间推理能力开辟了一条有希望的道路。

🧩AlphaMaze是一个双阶段训练框架,通过结合监督微调(SFT)和群体相对策略优化(GRPO),提升大型语言模型(LLM)在迷宫导航等空间任务中的决策能力。

🗺️第一阶段,通过SFT,LLM接触到token化的迷宫视觉表示数据集,学习预测移动指令,理解迷宫中的墙壁、路径、起点和目标等空间关系,为后续的推理打下基础。

🎯第二阶段,利用GRPO进行强化学习,优化决策过程,奖励高效准确的导航策略,无需人工反馈,通过迭代优化,提高模型在复杂环境中解决问题的能力。

📊实验结果显示,经过AlphaMaze训练的模型在迷宫求解准确率上取得了显著提升,从最初的0%提升至93%,展示了该框架在提高LLM空间推理能力方面的有效性。MazeBench评估框架由100个独特的迷宫挑战组成,包含简单、中等和困难级别,确保了在不同复杂程度下对性能提升进行评估。

Artificial intelligence continues to advance in natural language processing but still faces challenges in spatial reasoning tasks. Visual-spatial reasoning is fundamental for robotics, autonomous navigation, and interactive problem-solving applications. AI systems must effectively interpret structured environments and execute sequential decisions to function in these domains. While traditional maze-solving algorithms, such as depth-first search and A*, provide deterministic solutions, they do not generalize well to varied spatial tasks. Advancements in deep learning and reinforcement learning offer potential solutions, but existing methods struggle with efficiency and adaptability in real-world applications.

A major challenge in AI spatial reasoning is enabling language models to interpret and execute actions based on visual information. Large Language Models (LLMs) process textual data proficiently but lack intrinsic spatial understanding. Their token-based learning structure does not naturally map complex visual environments into sequential decision-making. Training such models to comprehend and navigate structured spaces like mazes requires novel methodologies incorporating tokenized visual data. Without an effective framework for integrating these representations, models cannot accurately predict movement sequences or adapt their reasoning to changing environments.

Prior methods for solving spatial tasks in AI include supervised training approaches that employ labeled datasets. Reinforcement learning techniques have also been explored, particularly in robotics and autonomous systems. These approaches, however, require extensive computational resources and often rely on manually curated datasets. Despite some success, these methods fail to generalize across different problem settings and struggle with multi-step reasoning. AI-driven spatial reasoning requires a systematic training approach that improves adaptability and decision-making without excessive human intervention.

Researchers at Menlo Research introduced AlphaMaze, a two-stage training framework to enhance LLMs’ ability to reason spatially. The framework integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to improve decision-making in maze navigation. The training starts by exposing the model to a curated dataset of tokenized maze representations, allowing it to learn step-by-step movement sequences. Once the model demonstrates basic competency, GRPO is applied to refine sequential decision-making and encourage structured reasoning. By optimizing reinforcement learning strategies, this approach bridges the gap between language processing and spatial problem-solving.

The training framework consists of two distinct phases. Initially, Supervised Fine-Tuning (SFT) is used to introduce LLMs to tokenized visual representations of mazes. The model learns to predict movement commands by processing spatial relationships encoded within the dataset. Each maze is structured as a grid where unique tokens represent walls, pathways, start points, and targets. This structured input allows the model to understand movement constraints and potential pathways. The second phase introduces GRPO, a reinforcement learning approach that refines decision-making by rewarding efficient and accurate navigation strategies. Unlike standard reinforcement learning, GRPO leverages group-based optimization techniques and eliminates reliance on human feedback. The model undergoes iterative refinements, progressively improving its ability to solve mazes with minimal errors and self-correcting behaviors.

Experimental results demonstrated a clear improvement in maze-solving accuracy. The baseline model, which lacked structured training, failed to navigate any mazes successfully. When trained using SFT, the model achieved an accuracy of 86%, demonstrating its ability to process tokenized spatial representations effectively. Further refinement using GRPO increased accuracy to 93%, highlighting the effectiveness of reinforcement learning in enhancing spatial reasoning. The model displayed emergent reasoning behaviors, including chain-of-thought decision-making and adaptive path correction. Throughout 1600 training steps, GRPO progressively optimized the model’s ability to navigate complex environments, significantly reducing invalid movement sequences and increasing problem-solving efficiency. The introduction of MazeBench, a structured evaluation framework consisting of 100 unique maze challenges, provided rigorous benchmarking. The dataset included easy, medium, and hard difficulty levels, ensuring that performance gains were assessed across varying complexity levels.

Findings from this research demonstrate the viability of combining supervised learning with reinforcement optimization to improve AI-driven spatial reasoning. Using tokenized visual representations and sequential refinement enables LLMs to adapt their decision-making strategies dynamically. The study also reinforces the importance of structured input formatting in AI training processes, as models trained without specific reasoning markers showed significantly lower performance. While the framework showed substantial improvements, further refinements to reward functions and training pipelines could lead to even greater enhancements in complex problem-solving scenarios. This research presents a promising path toward equipping LLMs with advanced spatial reasoning capabilities for real-world applications by integrating structured training methodologies.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post This AI Paper from Menlo Research Introduces AlphaMaze: A Two-Stage Training Framework for Enhancing Spatial Reasoning in Large Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AlphaMaze 空间推理 大型语言模型 强化学习
相关文章