MarkTechPost@AI 2024年11月10日
RAGCache: Optimizing Retrieval-Augmented Generation with Dynamic Caching
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

RAGCache 是一种新颖的多级动态缓存系统,旨在优化检索增强生成(RAG)。RAG 通过整合外部知识来增强大型语言模型(LLM)的能力,但通常会带来高计算和内存成本。RAGCache 通过引入知识树和前缀感知贪婪双大小频率(PGDSF)替换策略,缓存检索到的文档的中间状态,从而显著提高缓存命中率并减少冗余计算。此外,它还采用了动态推测流水线技术,重叠检索和推理阶段,降低端到端延迟。实验结果表明,RAGCache 可以将首次令牌时间(TTFT)缩短高达 4 倍,吞吐量提高 2.1 倍,使其成为大规模实时 RAG 应用的理想解决方案。

🤔RAGCache 是一种用于优化检索增强生成(RAG)的多级动态缓存系统,旨在解决 RAG 中高计算和内存成本的问题。

💡RAGCache 使用知识树来组织缓存的检索文档的键值张量,并采用前缀感知贪婪双大小频率(PGDSF)替换策略来管理缓存,最大程度地减少缓存未命中。

🔄RAGCache 通过动态推测流水线技术重叠向量检索和 LLM 推理阶段,减少了由于顺序执行造成的延迟,从而提高了效率。

🚀实验结果显示,RAGCache 与传统 RAG 系统相比,首次令牌时间(TTFT)缩短了高达 4 倍,吞吐量提高了 2.1 倍,显著提升了性能。

🎯RAGCache 通过高效缓存频繁访问的文档,降低了计算负担,使其成为需要大量相似检索请求的场景的理想选择。

Retrieval-Augmented Generation (RAG) has significantly enhanced the capabilities of large language models (LLMs) by incorporating external knowledge to provide more contextually relevant and accurate responses. However, this technique comes with a major downside: it often leads to high computational and memory costs. These challenges are primarily due to the injection of long sequences of external documents into the requests, which can expand the original sequence length by more than tenfold. As a result, the increased computational and memory requirements hinder the efficiency of RAG, posing a substantial obstacle to its scalability for real-time applications. Previous attempts to optimize LLM inference through sharing intermediate states have been useful, but they fail to fully address the unique demands of RAG, particularly those arising from long sequence generation and frequent knowledge retrieval.

A team of researchers from Peking University and ByteDance introduced RAGCache, a novel multilevel dynamic caching system specifically designed to optimize Retrieval-Augmented Generation. It tackles the inefficiencies of traditional RAG setups by introducing a knowledge tree that caches the intermediate states of retrieved documents in both GPU and host memory hierarchies. RAGCache uses a replacement policy tailored to be aware of LLM inference characteristics and RAG retrieval patterns, significantly improving cache hit rates. Additionally, the system overlaps the retrieval and inference stages, reducing end-to-end latency. This design allows RAGCache to dynamically cache and manage key-value tensors, making it the first system capable of sharing these states across multiple requests. By doing so, RAGCache reduces redundant computations and accelerates response times while also leveraging GPU and host memory in an efficient manner.

RAGCache employs a knowledge tree to organize the cached key-value tensors of retrieved documents. Frequently accessed documents are stored in fast GPU memory, while less frequently accessed ones are stored in slower host memory. A core innovation of RAGCache is its prefix-aware Greedy-Dual-Size-Frequency (PGDSF) replacement policy, which carefully considers the document order, frequency, size, and recency to minimize cache misses. This design ensures that the most valuable intermediate states are retained and reused, leading to significantly reduced processing times for subsequent requests. Another key feature is dynamic speculative pipelining, which overlaps the vector retrieval and LLM inference steps, mitigating the latency caused by sequential execution. These technical improvements culminate in a system that achieves up to 4× faster time to first token (TTFT) and up to 2.1× improved throughput compared to traditional setups like vLLM integrated with Faiss.

The importance of RAGCache lies in its ability to make RAG more practical for real-time and large-scale use cases. In the benchmarks conducted, RAGCache was implemented on vLLM, a leading LLM inference system, alongside Faiss, a popular vector database. The results were compelling: RAGCache reduced the time to first token by up to 4× and improved throughput by 2.1× compared with vLLM using Faiss. Furthermore, when compared to SGLang, a high-performance LLM serving system, RAGCache still showed substantial improvements of up to 3.5× reduction in TTFT and 1.8× enhancement in throughput. These performance gains underscore the efficiency of multilevel caching combined with advanced retrieval and generation overlapping techniques. By ensuring that frequently accessed documents are efficiently cached, RAGCache significantly lowers computational burdens, making it ideal for scenarios that involve high volumes of similar retrieval requests.

RAGCache represents a transformative step in optimizing Retrieval-Augmented Generation by introducing an intelligent, multilevel caching system that reduces latency and boosts throughput. Its innovative approach to caching intermediate states across multiple requests and dynamically managing memory across GPU and host levels directly addresses the bottlenecks of current RAG systems. The experimental results show that RAGCache can provide substantial performance improvements, making it a powerful tool for scaling up RAG in practical, real-time applications. As LLMs continue to grow in complexity and size, solutions like RAGCache are critical for ensuring that these technologies can be deployed efficiently without compromising on speed or computational cost.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[AI Magazine/Report] Read Our Latest Report on ‘SMALL LANGUAGE MODELS

The post RAGCache: Optimizing Retrieval-Augmented Generation with Dynamic Caching appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RAGCache 检索增强生成 大型语言模型 缓存 优化
相关文章