MarkTechPost@AI 04月24日 13:55
Sequential-NIAH: A Benchmark for Evaluating LLMs in Extracting Sequential Information from Long Texts
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Sequential-NIAH基准测试,旨在评估大型语言模型(LLMs)从长文本中提取顺序信息的能力。该基准测试包含合成、真实和开放域的问答数据,测试了多个LLMs,包括Gemini-1.5、GPT-4o等,测试结果表明,当前LLMs在处理长文本顺序信息提取方面仍面临挑战。尽管合成评估模型表现出色,但实际模型测试结果并不理想,突显了在长文本理解方面进一步改进的必要性。

🔍 **Sequential-NIAH基准测试的定义**:该基准测试专门评估LLMs从长文本中提取顺序信息的能力,文本长度可达128,000个token。它包含了合成、真实和开放域的问答对,用于测试模型在不同场景下的表现。

📊 **测试数据集构成**:Sequential-NIAH测试集包含14,000个样本,分为训练集、开发集和测试集,涵盖英语和中文。这些样本由三种QA合成管道生成:合成数据(按顺序生成的事件)、真实数据(从时间知识图谱中提取)和开放域QA(逻辑排序的答案)。

💡 **模型表现与挑战**:在对Claude-3.5、GPT-4o等模型的测试中,尽管合成评估模型达到了99.49%的准确率,但实际模型表现并不理想。Gemini-1.5表现最佳,准确率为63.15%,其他模型表现更差。测试结果表明,LLMs在长文本、多needle以及不同QA合成管道下,表现差异较大,突显了当前模型在处理长文本和顺序信息方面的挑战。

🔬 **实验分析与结论**:研究通过噪声分析验证了Sequential-NIAH的可靠性,结果显示,细微的扰动对准确率影响不大,但needle位置的较大变化会降低模型一致性。总之,该基准测试揭示了LLMs在处理长文本顺序信息方面的局限性,并为未来的研究提供了有价值的参考。

Evaluating how well LLMs handle long contexts is essential, especially for retrieving specific, relevant information embedded in lengthy inputs. Many recent LLMs—such as Gemini-1.5, GPT-4, Claude-3.5, Qwen-2.5, and others—have pushed the boundaries of context length while striving to maintain strong reasoning abilities. To assess such capabilities, benchmarks like ∞Bench, LongBench, and L-Eval have been developed. However, these often overlook the “Needle-in-a-Haystack” (NIAH) task, which challenges models to retrieve a few critical pieces of information from predominantly irrelevant content. Earlier benchmarks, such as RULER and Counting-Stars, offered synthetic and simplistic NIAH setups, utilizing items like passwords or symbols. NeedleBench improved this by including more realistic, semantically meaningful needles and logical reasoning questions. Yet, it still lacks tasks involving the retrieval and correct ordering of sequential information, such as timestamps or procedural steps.

Efforts to enhance LLMs’ long-context capabilities have employed methods like RoPE, ALiBi, and memory-based techniques, along with architectural changes seen in models like Mamba and FLASHBUTTERFLY. Modern LLMs now support extensive contexts—Gemini 1.5 and Kimi can process up to 1–2 million tokens. NIAH benchmarks assess how effectively models can extract relevant data from vast amounts of text, and NeedleBench further incorporates logical relationships to simulate real-world scenarios. Regarding evaluation, natural language generation (NLG) performance is typically assessed using metrics derived from LLMs, prompt-based evaluations, fine-tuned models, or human-LLM collaborations. While prompting alone often underperforms, fine-tuning and human-in-the-loop methods can greatly enhance evaluation accuracy and reliability.

Researchers from Tencent YouTu Lab have introduced Sequential-NIAH, a benchmark designed to assess how well LLMs retrieve sequential information, referred to as a needle, from long texts. The benchmark includes synthetic, real, and open-domain QA needles embedded in contexts ranging from 8K to 128K tokens, totaling 14,000 samples. A synthetic data-trained evaluation model achieved 99.49% accuracy in judging the correctness and order of responses. However, tests on six popular LLMs showed the highest performance at just 63.15%, highlighting the difficulty of the task and the need for further advancement in long-context comprehension.

The Sequential-NIAH benchmark is designed to evaluate models on retrieving sequentially ordered information (needles) from long texts (haystacks). It uses three types of QA synthesis pipelines: synthetic (generated events in order), real (extracted from temporal knowledge graphs), and open-domain QA (logically ordered answers). These QA pairs are inserted into diverse, long texts sourced from the LongData Corpus, covering various domains. To construct samples, the long text is segmented, needles are randomly shuffled and embedded, and the task is framed using prompt templates. The final dataset comprises 14,000 samples, split across training, development, and test sets, in both English and Chinese.

The evaluation model was tested against Claude-3.5, GPT-4o, and others on 1,960 samples, achieving a 99.49% accuracy. This outperforms GPT-4o (96.07%) and Claude-3.5 (87.09%) by significant margins. In subsequent benchmark tests on 2,000 samples, Gemini-1.5 outperformed other models with an accuracy of 63.15%, while GPT-4o-mini and GPT-4o performed poorly. Performance varied with text length, number of needles, QA synthesis pipelines, and languages, with Gemini-1.5 maintaining stable results. A noise analysis revealed that minor perturbations had a negligible impact on accuracy, but larger shifts in needle positions reduced model consistency, particularly for Qwen-2.5 and LLaMA-3.3.

In conclusion, the Sequential-NIAH benchmark assesses LLMs on their ability to extract sequential information from lengthy texts (up to 128,000 tokens). It includes synthetic, real, and open-domain question-answering pipelines, with 14,000 samples for training, development, and testing. Despite testing popular models like Claude, GPT-4.0, Gemini, LLaMA, and Qwen, none achieved high accuracy, with the best performing at 63.15%. A synthetic evaluation model achieved an accuracy of 99.49% on the test data. The benchmark also highlights the challenges of increasing context lengths and needle counts and is validated through noise robustness tests, making it valuable for advancing LLM research.


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Sequential-NIAH: A Benchmark for Evaluating LLMs in Extracting Sequential Information from Long Texts appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 长文本 Sequential-NIAH 基准测试 自然语言处理
相关文章