MarkTechPost@AI 2024年07月16日
Planetarium: A New Benchmark to Evaluate LLMs on Translating Natural Language Descriptions of Planning Problems into Planning Domain Definition Language PDDL
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Planetarium 是一个新的基准,用于评估大型语言模型 (LLM) 在将自然语言描述的规划问题翻译成规划领域定义语言 (PDDL) 方面的能力。研究人员发现,目前使用 LLM 解决规划问题的方法需要改进,因为即使在简单的规划任务中,GPT-4 的准确率也仅为 35%。Planetarium 提供了一个全面的数据集,包含 132,037 个带有相应文本描述的 PDDL 问题,涵盖了不同的抽象程度和规模。该基准还对当前 LLM 在零样本和微调设置中的性能进行了全面评估,揭示了该任务的难度。Planetarium 为衡量基于 LLM 的 PDDL 生成的进展提供了一个有价值的工具,并可用于未来的开发和评估。

🤔 Planetarium 旨在解决评估 LLM 将自然语言描述的规划问题翻译成 PDDL 代码的准确性和相关性的挑战。它提供了一个全面的数据集,包含 132,037 个带有相应文本描述的 PDDL 问题,涵盖了不同的抽象程度和规模。

🤖 Planetarium 引入了用于评估 PDDL 等价性的严格算法,该算法将 PDDL 代码转换为场景图,表示初始状态和目标状态。它通过添加所有微不足道的真边来完全指定目标场景,并通过连接初始场景图和目标场景图来创建问题图。

📊 Planetarium 对各种大型语言模型 (LLM) 在将自然语言描述翻译成 PDDL 方面的性能进行了评估。结果表明,GPT-4o、Mistral v0.3 7B Instruct 和 Gemma 1.1 IT 2B & 7B 在零样本设置中表现不佳,GPT-4o 的准确率最高,为 35.12%。

📈 微调显着提高了所有开放权重模型的性能。Mistral v0.3 7B Instruct 在微调后取得了最高的准确率。

🚀 Planetarium 对于评估 LLM 在将自然语言翻译成 PDDL 以用于规划任务方面的能力具有重要意义。它解决了关键的技术和社会挑战,强调了准确翻译的重要性,以防止因结果不一致而造成的潜在危害。

📝 即使对于 GPT-4 这样的高级模型,当前的性能水平也突出了该任务的复杂性以及需要进一步创新的必要性。随着基于 LLM 的规划系统的发展,Planetarium 为衡量进展和确保可靠性提供了一个重要的框架。

💡 这项研究突破了人工智能能力的界限,并强调了负责任地开发可信赖的人工智能规划系统的重要性。

Large language models (LLMs) have gained significant attention in solving planning problems, but current methodologies must be revised. Direct plan generation using LLMs has shown limited success, with GPT-4 achieving only 35% accuracy on simple planning tasks. This low accuracy highlights the need for more effective approaches. Another significant challenge lies in the lack of rigorous techniques and benchmarks for evaluating the translation of natural language planning descriptions into structured planning languages, such as the Planning Domain Definition Language (PDDL).

Researchers have explored various approaches to overcome the challenges of using LLMs for planning tasks. One method involves using LLMs to generate plans directly, but this has shown limited success due to poor performance even on simple planning tasks. Another approach, “Planner-Augmented LLMs,” combines LLMs with classical planning techniques. This method frames the problem as a machine translation task, converting natural language descriptions of planning problems into structured formats like PDDL, finite state automata, or logic programming.

The hybrid approach of translating natural language to PDDL utilizes the strengths of both LLMs and traditional symbolic planners. LLMs interpret natural language, while efficient traditional planners ensure solution correctness. However, evaluating code generation tasks, including PDDL translation, remains challenging. Existing evaluation methods, such as match-based metrics and plan validators, need to be revised in assessing the accuracy and relevance of generated PDDL to the original instructions.

Researchers from the Department of Computer Science at Brown University present Planetarium, a rigorous benchmark for evaluating LLMs’ ability to translate natural language descriptions of planning problems into PDDL, addressing the challenges in assessing PDDL generation accuracy. This benchmark offers a rigorous approach to evaluating PDDL equivalence, formally defining planning problem equivalence and providing an algorithm to check whether two PDDL problems satisfy this definition. Planetarium includes a comprehensive dataset featuring 132,037 ground truth PDDL problems with corresponding text descriptions, varying in abstraction and size. The benchmark also provides a broad evaluation of current LLMs in both zero-shot and fine-tuned settings, revealing the task’s difficulty. With GPT-4 achieving only 35.1% accuracy in a zero-shot setting, Planetarium serves as a valuable tool for measuring progress in LLM-based PDDL generation and is publicly available for future development and evaluation.

The Planetarium benchmark introduces a rigorous algorithm for evaluating PDDL equivalence, addressing the challenge of comparing different representations of the same planning problem. This algorithm transforms PDDL code into scene graphs, representing both initial and goal states. It then fully specifies the goal scenes by adding all trivially true edges and creates problem graphs by joining initial and goal scene graphs.

The equivalence check involves several steps: First, it performs quick checks for obvious non-equivalence or equivalence cases. If these fail, it proceeds to fully specify the goal scenes, identifying all propositions true in all reachable goal states. The algorithm then operates in two modes: one for problems where object identity matters, and another where objects in goal states are treated as placeholders. For problems with object identity, it checks isomorphism between combined problem graphs. For placeholder problems, it checks isomorphism between initial and goal scenes separately. This approach ensures a comprehensive and accurate evaluation of PDDL equivalence, capable of handling various representation nuances in planning problems.

The Planetarium benchmark evaluates the performance of various large language models (LLMs) in translating natural language descriptions into PDDL. Results show that GPT-4o, Mistral v0.3 7B Instruct, and Gemma 1.1 IT 2B & 7B all performed poorly in zero-shot settings, with GPT-4o achieving the highest accuracy at 35.12%. GPT-4o’s performance breakdown reveals that abstract task descriptions are more challenging to translate than explicit ones, while fully explicit task descriptions facilitate the easier generation of parseable PDDL codeThey is also so, Fine-tuning significantly improved performance across all open-weight models. Mistral v0.3 7B Instruct achieved the highest accuracy after fine-tuning.

This study introduces the Planetarium benchmark which marks a significant advance in evaluating LLMs’ ability to translate natural language into PDDL for planning tasks. It addresses crucial technical and societal challenges, emphasizing the importance of accurate translations to prevent potential harm from misaligned results. Current performance levels, even for advanced models like GPT-4, highlight the complexity of this task and the need for further innovation. As LLM-based planning systems evolve, Planetarium provides a vital framework for measuring progress and ensuring reliability. This research pushes the boundaries of AI capabilities and underscores the importance of responsible development in creating trustworthy AI planning systems.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Planetarium: A New Benchmark to Evaluate LLMs on Translating Natural Language Descriptions of Planning Problems into Planning Domain Definition Language PDDL appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Planetarium LLM PDDL 规划问题 自然语言处理 人工智能 基准
相关文章