MarkTechPost@AI 2024年07月21日
ZebraLogic: A Logical Reasoning AI Benchmark Designed for Evaluating LLMs with Logic Puzzles
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ZebraLogic是一个由1000个逻辑网格谜题组成的基准测试,用于评估大型语言模型(LLM)的逻辑推理能力。每个谜题都包含N个房屋和M个特征,需要根据给定的线索将唯一的值分配给这些特征。这项任务是一个约束满足问题(CSP),用于评估演绎推理能力,通常用于诸如法学院入学考试(LSAT)等评估中,以衡量人类的逻辑能力。

🤔 ZebraLogic基准测试由1000个程序生成的谜题组成,大小从2×2到6×6不等,每个尺寸类别有40个谜题。大型语言模型使用一次性示例方法进行测试。这种方法包括提供推理步骤和JSON格式的解决方案。模型被指示首先输出它们的推理过程,然后以与提供的示例相同的JSON格式呈现它们的答案。这种标准化方法能够在各种谜题复杂度中一致地评估LLM的逻辑推理能力。

📊 评估采用两个主要指标:谜题级准确率和单元级准确率。对于NxM谜题,单元级准确率衡量的是正确填充的单元数占NxM总单元数的比例。谜题级成功要求所有单元都正确。1000个谜题根据大小分为简单和困难子集。

🤖 结果表明,LLM缺乏复杂逻辑推理的关键能力:反事实思维、反思推理、结构化记忆和组合泛化。对于大多数模型来说,在困难的推理任务中,贪婪解码通常优于采样。Gemini-1.5模型在不同的解码方法中表现出意外的性能模式。

🧩 谜题创建过程涉及几个系统步骤: 1. 定义每个特征的可能值。 2. 使用包含占位符的语言模板来建立线索类型。 3. 通过将值随机分配给采样网格来生成解决方案。 4. 枚举所有描述变量关系的可能线索。 5. 通过加权采样迭代地删除线索,确保剩余的线索导致唯一的解决方案。 6. 使用提示模板格式化谜题,以便LLM输入。

💡 每个谜题都有N个房屋和M个特征,需要根据给定的线索将唯一的值分配给这些特征。这项任务是一个约束满足问题(CSP),用于评估演绎推理能力。通过分析和推理,模型需要找到正确的答案,并以JSON格式输出答案,以确保答案的唯一性和正确性。

Large language models (LLMs) demonstrate proficiency in information retrieval and creative writing, with notable improvements in mathematics and coding. ZebraLogic, a benchmark consisting of Logic Grid Puzzles, assesses LLMs’ logical reasoning capabilities. Each puzzle presents N houses with M features, requiring unique value assignments based on given clues. This task, a Constraint Satisfaction Problem (CSP), evaluates deductive reasoning skills and is commonly employed in assessments like the Law School Admission Test (LSAT) to measure human logical aptitude.

The study presents an example, a 2×3 Logic Grid Puzzle with two houses and three features: names, car models, and animals. Clues provide essential information for deductive reasoning. Through logical analysis, the solution emerges: Eric occupies House 1, owns a Ford F150, and keeps horses. Arnold resides in House 2, drives a Tesla Model 3, and owns a cat. This example demonstrates the step-by-step reasoning process required to solve such puzzles, illustrating the logical deduction skills evaluated by the ZebraLogic benchmark.

The ZebraLogic benchmark comprises 1,000 programmatically generated puzzles, ranging from 2×2 to 6×6 in size, with 40 puzzles per size category. Large language models undergo testing using a one-shot example approach. This method includes providing reasoning steps and a JSON-formatted solution. Models are instructed to output their reasoning process first, followed by presenting their answers in the same JSON format as the provided example. This standardized approach enables consistent evaluation of LLMs’ logical reasoning abilities across various puzzle complexities.

The evaluation employs two primary metrics: puzzle-level accuracy and cell-wise accuracy. For NxM puzzles, cell-wise accuracy measures the proportion of correctly filled cells out of NxM total cells. Puzzle-level success requires all cells to be correct. The 1,000 puzzles are categorized into easy and hard subsets based on size.

Random guessing probability for correct feature assignment is 1/(N!), and for all cells is (1/N!)M. Logarithmic values of these probabilities are presented in a table, illustrating the increasing difficulty with puzzle size. This approach quantifies the complexity and assesses LLM performance against random chance.

Humans solve these puzzles using strategic reasoning, employing techniques like reductio ad absurdum and elimination. LLMs demonstrate weakness in logical reasoning tasks, with Claude 3.5 Sonnet achieving 33.4% overall accuracy and 12.4% on hard puzzles. DeepSeek-v2-Chat (0628) outperforms other open-weight models. Smaller models (7-10 billion parameters) struggle significantly with hard puzzles.

Results indicate LLMs lack crucial abilities for complex logical reasoning: counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization. Greedy decoding generally outperforms sampling for most models in hard reasoning tasks. Gemini-1.5 models show unexpected performance patterns across different decoding methods.

Human performance varies by puzzle size, with solving times ranging from 15 seconds for 2×2 puzzles to 10-15 minutes for 4×4 puzzles. A demo on HuggingFace allows for the exploration of the data and leaderboard.

The puzzle creation process involves several systematic steps:

1. Define features and possible values for each.

2. Establish clue types with language templates containing placeholders.

3. Generate solutions by randomly assigning values to a sampled grid.

4. Enumerate all possible clues describing variable relationships.

5. Iteratively remove clues through weighted sampling, ensuring remaining clues lead to a unique solution.

6. Format puzzles using prompting templates for LLM input.

Clue types include: Found_At, Not_At, Same_House, Direct_Left/Right, Side_By_Side, Left/Right_Of, and One/Two_between. Each type represents a specific logical constraint, allowing for diverse and challenging puzzle configurations.

ZebraLogic, a benchmark of 1,000 Logic Grid Puzzles, evaluates LLMs’ logical reasoning abilities. These puzzles require assigning unique values to features across N houses based on given clues. The study uses puzzle-level and cell-wise accuracy metrics, comparing LLM performance to random guessing probabilities. Results show LLMs struggle with complex logical reasoning, with the best model (Claude 3.5 Sonnet) solving only 33.4% of all puzzles and 12.4% of hard puzzles. The research highlights LLMs’ deficiencies in counterfactual thinking, reflective reasoning, structured memorization, and compositional generalization. The article details the puzzle creation process and various clue types used, providing insights into the challenges of logical reasoning for AI systems.


Check out the Benchmark, GitHub, and Dataset Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here

The post ZebraLogic: A Logical Reasoning AI Benchmark Designed for Evaluating LLMs with Logic Puzzles appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ZebraLogic 逻辑推理 大型语言模型 LLM 基准测试
相关文章