Plancraft: an evaluation dataset for planning with LLM agents

cs.AI updates on arXiv.org 20小时前

Plancraft: an evaluation dataset for planning with LLM agents

本文介绍了Plancraft，一个基于Minecraft的LLM智能体多模态评估数据集，包含文本和图像界面，用于评估工具使用和RAG，并对比了LLM和VLM的性能。

arXiv:2412.21033v2 Announce Type: replace-cross Abstract: We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as a handcrafted planner and Oracle Retriever, to ablate the different components of a modern agent architecture. To evaluate decision-making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open-source and closed-source LLMs and compare their performance and efficiency to a handcrafted planner. Overall, we find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and offer suggestions on how to improve their capabilities.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态评估数据集 LLM智能体 Minecraft RAG 性能评估

相关文章

Databricks Announces Major Updates to Its AI Suite to Boost AI Model Accuracy

Are Vector DBs the Future Data Platform for AI? with Ed Anuff - #664

GenAI to HPC Jobs in Code Generation, Using NVIDIA Tech

很多朋友私信问这个自动写提示词的工具，再发一次： https://chatgpt.com/g/g-kI5r8f57x-zi-dong-jie-gou-hua-kuang-jia 结合 RAG 可以快速搓出 prompt 的逻辑框...

Meet Verba 1.0: Run State-of-the-Art RAG Locally with Ollama Integration and Open Source Models

Researchers at the University of Freiburg and Bosch AI Propose HW-GPT-Bench: A Hardware-Aware Language Model Surrogate Benchmark

Cognita: An Open Source Framework for Building Modular RAG Applications

Combining the Best of Both Worlds: Retrieval-Augmented Generation for Knowledge-Intensive Natural Language Processing

‘RAG Me Up’: A Generic AI Framework (Server + UIs) that Enables You to Do RAG on Your Own Dataset Easily

Build Your First AI Agents in 5 Easy Steps!