LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

cs.AI updates on arXiv.org 19小时前

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

本文提出长文档理解新基准LongDocURL，整合长文档理解、数值推理和跨元素定位等20个子任务，涵盖超过33,000页文档，并通过实验揭示该领域的性能差距。

arXiv:2412.18424v3 Announce Type: replace Abstract: Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

长文档理解基准测试性能评估

相关文章

Cross-Device AI Acceleration, Compilation & Execution with Jeff Gehlhaar - #500

TIGER-Lab Introduces MMLU-Pro Dataset for Comprehensive Benchmarking of Large Language Models’ Capabilities and Performance

Researchers at the University of Freiburg and Bosch AI Propose HW-GPT-Bench: A Hardware-Aware Language Model Surrogate Benchmark

MMLU-Pro: An Enhanced Benchmark Designed to Evaluate Language Understanding Models Across Broader and More Challenging Tasks

Show HN: 搜索系统的评估指标

benchexec： BenchExec：可靠的基准测试和资源测量框架

A Comprehensive Study by BentoML on Benchmarking LLM Inference Backends: Performance Analysis of vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI

Benchmarking Federated Learning for Large Language Models with FedLLM-Bench

MLPerf Training 4.0 – Nvidia Still King; Power and LLM Fine Tuning Added

Unlocking the Language of Proteins: How Large Language Models Are Revolutionizing Protein Sequence Understanding