Advocate for Complete Benchmarks for Formal Reasoning with Formal/Informal Statements and Formal/Informal Proofs

cs.AI updates on arXiv.org 前天 12:33

Advocate for Complete Benchmarks for Formal Reasoning with Formal/Informal Statements and Formal/Informal Proofs

本文针对形式推理与自动定理证明领域的基准测试和评估实践进行批判性分析，提出开放代码、数据和完备无错误的基准将加速该领域发展，并探讨如何消除阻碍贡献的实践及可能产生误导性信息的评估方式。

arXiv:2507.04719v1 Announce Type: new Abstract: This position paper provides a critical but constructive discussion of current practices in benchmarking and evaluative practices in the field of formal reasoning and automated theorem proving. We take the position that open code, open data, and benchmarks that are complete and error-free will accelerate progress in this field. We identify practices that create barriers to contributing to this field and suggest ways to remove them. We also discuss some of the practices that might produce misleading evaluative information. We aim to create discussions that bring together people from various groups contributing to automated theorem proving, autoformalization, and informal reasoning.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

形式推理自动定理证明基准测试开放数据评估实践

相关文章

Cross-Device AI Acceleration, Compilation & Execution with Jeff Gehlhaar - #500

TIGER-Lab Introduces MMLU-Pro Dataset for Comprehensive Benchmarking of Large Language Models’ Capabilities and Performance

Researchers at the University of Freiburg and Bosch AI Propose HW-GPT-Bench: A Hardware-Aware Language Model Surrogate Benchmark

MMLU-Pro: An Enhanced Benchmark Designed to Evaluate Language Understanding Models Across Broader and More Challenging Tasks

benchexec： BenchExec：可靠的基准测试和资源测量框架

A Comprehensive Study by BentoML on Benchmarking LLM Inference Backends: Performance Analysis of vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI

Benchmarking Federated Learning for Large Language Models with FedLLM-Bench

Unlocking the Language of Proteins: How Large Language Models Are Revolutionizing Protein Sequence Understanding

BiGGen Bench: A Benchmark Designed to Evaluate Nine Core Capabilities of Language Models

TopoBenchmarkX: A Modular Open-Source Library Designed to Standardize Benchmarking and Accelerate Research in Topological Deep Learning (TDL)