Compositional GSM: A New AI Benchmark for Evaluating Large Language Models’ Reasoning Capabilities in Multi-Step Problems

MarkTechPost@AI 2024年10月06日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Compositional GSM是一个新基准，用于评估大型语言模型（LLMs）在多步骤问题中的推理能力。它通过将两个数学问题链接起来，使得第一个问题的答案成为第二个问题的变量，从而测试LLMs处理问题间依赖关系的能力。研究发现，即使在标准基准测试中表现出色，大多数LLMs在处理需要多步骤推理的复杂问题时仍然存在困难。Compositional GSM为评估LLMs的推理能力提供了更全面的方法，并强调了开发能够进行多步骤推理的模型的重要性。

😊 **Compositional GSM 基准测试**：该研究引入了 Compositional GSM 基准测试，它通过将两个数学问题链接起来，使得第一个问题的答案成为第二个问题的变量，从而测试LLMs处理问题间依赖关系的能力。该方法旨在更全面地评估LLMs的推理能力，因为它要求模型能够将信息从一个问题传递到另一个问题，并需要正确解决这两个问题才能获得成功的答案。研究人员使用了一系列LLMs进行了评估，包括开放权重模型（如LLAMA3）和封闭权重模型（如GPT和Gemini家族）。他们使用了三种测试集：原始 GSM8K 测试集、修改后的 GSM8K 版本（其中一些变量被替换）以及新的 Compositional GSM 测试集，每个测试集包含 1,200 个示例。模型使用 8 次提示方法进行了测试，即在要求模型解决组合问题之前，先给它们提供几个示例。这种方法使研究人员能够全面地评估模型的性能，包括它们单独解决问题和在组合环境中解决问题的能力。

🤔 **推理能力差距**：研究结果表明，在推理能力方面存在显著差距。例如，像 GPT-4o mini 这样的成本效益模型在 Compositional GSM 上的推理差距比在标准 GSM8K 上的推理差距大 2 到 12 倍。此外，像 Qwen2.5-MATH-72B 这样的专门针对数学的模型，在高中比赛级别的问题上取得了 80% 以上的准确率，但只能解决不到 60% 的 Compositional GSM 问题。这种大幅下降表明，除了专门的数学训练之外，还需要其他方法来充分地为模型准备多步骤推理任务。此外，研究人员观察到，尽管像 LLAMA3-8B 和 Mistral-7B 这样的模型在解决单个问题时取得了高分，但在需要将答案链接到相关问题时，它们的得分急剧下降。

🚀 **指令微调和代码生成的影响**：研究人员还探讨了指令微调和代码生成对模型性能的影响。指令微调改善了小型模型在标准 GSM8K 问题上的结果，但在 Compositional GSM 上只产生了轻微的改进。另一方面，生成代码解决方案而不是使用自然语言，使一些小型模型在 Compositional GSM 上的性能提高了 71% 到 149%。这一发现表明，虽然代码生成有助于缩小推理差距，但它并不能完全消除差距，各种模型之间在推理能力方面的系统性差异仍然存在。

🧐 **第二跳推理差距**：对推理差距的分析表明，性能下降并非由于测试集泄漏，而是由于额外上下文造成的干扰和糟糕的第二跳推理。例如，当像 LLAMA3-70B-IT 和 Gemini 1.5 Pro 这样的模型被要求使用第一个问题的答案来解决第二个问题时，它们经常无法准确地应用解决方案，导致最终答案错误。这种现象被称为第二跳推理差距，在较小的模型中更为明显，这些模型在解决复杂问题时往往会忽略关键细节。

💡 **结论和未来方向**：该研究强调，当前的LLMs，无论它们在标准基准测试中的表现如何，在处理组合推理任务时仍然存在困难。研究中引入的 Compositional GSM 基准测试为评估LLMs的推理能力提供了宝贵的工具，超越了孤立的问题解决。这些结果表明，需要更强大的训练策略和基准设计来增强这些模型的组合能力，使它们能够在复杂的解决问题场景中表现更好。这项研究强调了重新评估现有评估方法并优先发展能够进行多步骤推理的模型的重要性。

Natural language processing (NLP) has experienced rapid advancements, with large language models (LLMs) being used to tackle various challenging problems. Among the diverse applications of LLMs, mathematical problem-solving has emerged as a benchmark to assess their reasoning abilities. These models have demonstrated remarkable performance on math-specific benchmarks such as GSM8K, which measures their capabilities to solve grade-school math problems. However, there is an ongoing debate regarding whether these models truly comprehend mathematical concepts or exploit patterns within training data to produce correct answers. This has led to a need for a deeper evaluation to understand the extent of their reasoning capabilities in handling complex, interconnected problem types.

Despite their success on existing math benchmarks, researchers identified a critical problem: most LLMs need to exhibit consistent reasoning when faced with more complex, compositional questions. While standard benchmarks involve solving individual problems independently, real-world scenarios often require understanding relationships between multiple problems, where the answer to one question must be used to solve another. Traditional evaluations do not adequately represent such scenarios, which focus only on isolated problem-solving. This creates a discrepancy between the high benchmark scores and LLMs’ practical usability for complex tasks requiring step-by-step reasoning and deeper understanding.

Researchers from Mila, Google DeepMind, and Microsoft Research have introduced a new evaluation method called “Compositional Grade-School Math (GSM).” This method involves chaining two separate math problems such that the solution to the first problem becomes a variable in the second problem. Using this approach, researchers can analyze the LLMs’ abilities to handle dependencies between questions, a concept that needs to be adequately captured by existing benchmarks. The Compositional GSM method offers a more comprehensive assessment of LLMs’ reasoning capabilities by introducing linked problems that require the model to carry information from one problem to another, making it necessary to solve both correctly for a successful outcome.

The evaluation was carried out using a variety of LLMs, including open-weight models like LLAMA3 and closed-weight models like GPT and Gemini families. The study included three test sets: the original GSM8K test split, a modified version of GSM8K where some variables were substituted, and the new Compositional GSM test set, each containing 1,200 examples. Models were tested using an 8-shot prompting method, where they were given several examples before being asked to solve the compositional problems. This method enabled the researchers to benchmark the models’ performance comprehensively, considering their ability to solve problems individually and in a compositional context.

The results showed a considerable gap in reasoning abilities. For instance, cost-efficient models such as GPT-4o mini exhibited a 2 to 12 times worse reasoning gap on compositional GSM compared to their performance on the standard GSM8K. Further, math-specialized models like Qwen2.5-MATH-72B, which achieved above 80% accuracy on high-school competition-level questions, could only solve less than 60% of the compositional grade-school math problems. This substantial drop suggests that more than specialized training in mathematics is needed to prepare models for multi-step reasoning tasks adequately. Furthermore, it was observed that models like LLAMA3-8B and Mistral-7B, despite achieving high scores on isolated problems, showed a sharp decline when required to link answers between related problems.

The researchers also explored the impact of instruction tuning and code generation on model performance. Instruction-tuning improved results for smaller models on standard GSM8K problems but led to only minor improvements on compositional GSM. Meanwhile, generating code solutions instead of using natural language resulted in a 71% to 149% improvement for some smaller models on compositional GSM. This finding indicates that while code generation helps reduce the reasoning gap, it does not eliminate it, and systematic differences in reasoning capabilities persist among various models.

Analysis of the reasoning gaps revealed that the performance drop was not due to test-set leakage but rather to distractions caused by additional context and poor second-hop reasoning. For example, when models like LLAMA3-70B-IT and Gemini 1.5 Pro were required to solve a second question using the answer of the first, they frequently needed to apply the solution accurately, resulting in incorrect final answers. This phenomenon, referred to as the second-hop reasoning gap, was more pronounced in smaller models, which tended to overlook crucial details when solving complex problems.

The study highlights that current LLMs, regardless of their performance on standard benchmarks, still struggle with compositional reasoning tasks. The Compositional GSM benchmark introduced in the research provides a valuable tool for evaluating the reasoning abilities of LLMs beyond isolated problem-solving. These results suggest that more robust training strategies and benchmark designs are needed to enhance the compositional capabilities of these models, enabling them to perform better in complex problem-solving scenarios. This research underscores the importance of reassessing existing evaluation methods and prioritizing the development of models capable of multi-step reasoning.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!

The post Compositional GSM: A New AI Benchmark for Evaluating Large Language Models’ Reasoning Capabilities in Multi-Step Problems appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签