MarkTechPost@AI 2024年12月19日
Microsoft AI Introduces SCBench: A Comprehensive Benchmark for Evaluating Long-Context Methods in Large Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软与萨里大学的研究人员推出了SCBench,这是一个专门用于评估长文本大型语言模型(LLM)在多轮交互场景下性能的基准测试。该基准侧重于KV缓存的生命周期,包括生成、压缩、检索和加载四个阶段,并通过12个任务和两种共享上下文模式(多轮和多请求)来评估模型。研究结果表明,传统的长文本方法在单轮交互中表现良好,但在多轮交互中面临挑战。SCBench的引入旨在弥合这一评估差距,为开发更高效、更可靠的长文本LLM提供新的思路。

🚀SCBench基准测试专注于KV缓存的生命周期,将长文本LLM的处理过程分为生成、压缩、检索和加载四个阶段,为评估长文本方法提供了一个系统化的框架。

🔄SCBench评估了两种共享上下文模式:多轮交互和多请求。这两种模式更贴近实际应用,能够更全面地测试长文本LLM在真实场景下的性能。

📊研究人员通过SCBench评估了包括Llama-3和GLM-4在内的六个开源长文本LLM,以及八种长文本解决方案。结果显示,子O(n)方法在多轮场景中表现不佳,而O(n)方法则表现出强大的鲁棒性。

💡SCBench通过12项任务,包括字符串和语义检索、多任务处理和全局处理,全面评估了长文本LLM的能力。这些任务覆盖了长文本处理的多个方面,有助于深入了解模型的优缺点。

Long-context LLMs enable advanced applications such as repository-level code analysis, long-document question-answering, and many-shot in-context learning by supporting extended context windows ranging from 128K to 10M tokens. However, these capabilities come with computational efficiency and memory usage challenges during inference. Optimizations that leverage the Key-Value (KV) cache have emerged to address these issues, focusing on improving cache reuse for shared contexts in multi-turn interactions. Techniques like PagedAttention, RadixAttention, and CacheBlend aim to reduce memory costs and optimize cache utilization but are often evaluated only in single-turn scenarios, overlooking real-world multi-turn applications.

Efforts to improve long-context inference focus on reducing computational and memory bottlenecks during pre-filling and decoding stages. Pre-filling optimizations, such as sparse attention, linear attention, and prompt compression, reduce the complexity of handling large context windows. Decoding strategies, including static and dynamic KV compression, cache offloading, and speculative decoding, aim to manage memory constraints effectively. While these methods enhance efficiency, many rely on lossy compression techniques, which can compromise performance in multi-turn settings where prefix caching is essential. Existing conversational benchmarks prioritize single-turn evaluations, leaving a gap in assessing solutions for shared contexts in real-world scenarios.

Researchers from Microsoft and the University of Surrey introduced SCBench, a benchmark designed to evaluate long-context methods in LLMs through a KV cache-centric approach. SCBench assesses four stages of KV cache: generation, compression, retrieval, and loading across 12 tasks and two shared context modes (multi-turn and multi-request). The benchmark analyzes methods like sparse attention, compression, and retrieval on models such as Llama-3 and GLM-4. Results highlight that sub-O(n) memory methods struggle in multi-turn scenarios, while O(n) memory approaches perform robustly. SCBench provides insights into sparsity effects, task complexity, and challenges like distribution shifts in long-generation scenarios.

The KV-cache-centric framework categorizes long-context methods in LLMs into four stages: generation, compression, retrieval, and loading. Generation includes techniques like sparse attention and prompt compression, while compression involves methods like KV cache dropping and quantization. Retrieval focuses on fetching relevant KV cache blocks to optimize performance, and loading involves dynamically transferring KV data for computation. The SCBench benchmark evaluates these methods across 12 tasks, including string and semantic retrieval, multi-tasking, and global processing. It analyzes performance metrics, such as accuracy and efficiency, while offering insights into algorithm innovation, including Tri-shape sparse attention, which improves multi-request scenarios.

The researchers evaluated six open-source long-context LLMs, including Llama-3.1, Qwen2.5, GLM-4, Codestal-Mamba, and Jamba, representing various architectures such as Transformer, SSM, and SSM-Attention hybrids. Experiments used BFloat16 precision on NVIDIA A100 GPUs with frameworks like HuggingFace, vLLM, and FlashAttention-2. Eight long-context solutions were tested, including sparse attention, KV cache management, and prompt compression. Results showed that MInference outperformed in retrieval tasks, while A-shape and Tri-shape excelled in multi-turn tasks. KV compression methods and prompt compression yielded mixed outcomes, often underperforming in retrieval tasks. SSM-attention hybrids struggled in multi-turn interactions, and gated linear models showed poor performance overall.

In conclusion, the study highlights a critical gap in evaluating long-context methods, which traditionally focus on single-turn interactions, neglecting multi-turn, shared-context scenarios prevalent in real-world LLM applications. The SCBench benchmark is introduced to address this, assessing long-context methods from a KV cache lifecycle perspective: generation, compression, retrieval, and loading. It includes 12 tasks across two shared-context modes and four key capabilities: string retrieval, semantic retrieval, global information processing, and multitasking. Evaluating eight long-context methods and six state-of-the-art LLMs reveals that sub-O(n) methods struggle in multi-turn settings. In contrast, O(n) approaches excel, offering valuable insights for improving long-context LLMs and architectures.


Check out the Paper and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Microsoft AI Introduces SCBench: A Comprehensive Benchmark for Evaluating Long-Context Methods in Large Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SCBench 长文本LLM KV缓存 多轮交互 基准测试
相关文章