MarkTechPost@AI 02月02日
Researchers from Stanford, UC Berkeley and ETH Zurich Introduces WARP: An Efficient Multi-Vector Retrieval Engine for Faster and Scalable Search
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

多向量检索是信息检索的关键进展,尤其是在采用基于Transformer模型后。与单向量检索不同,多向量检索允许每个文档和查询有多个嵌入,提供更细粒度的表示,提高搜索准确性。然而,多向量检索的计算效率和检索性能之间存在挑战。WARP检索引擎通过动态相似性估算、隐式解压和两阶段缩减等优化,显著提高了检索速度,同时保持了检索质量。实验结果表明,WARP在检索延迟、索引大小和性能方面均优于现有方法,为未来多向量搜索系统的发展奠定了基础。

🚀多向量检索通过为每个文档和查询使用多个嵌入,提供了比单向量检索更精细的表示,从而提高了搜索的准确性和质量。

💡WARP引擎引入了WARPSELECT动态相似性估算方法,避免了不必要的计算,同时采用隐式解压机制减少内存操作,并通过两阶段缩减过程加速评分计算。

⏱️实验证明,WARP的端到端查询延迟比XTR参考实现减少了41倍,查询响应时间从6秒多降至171毫秒,并且比ColBERTv2/PLAID快三倍,同时索引大小也减少了2-4倍。

🌟WARP的出现标志着多向量检索优化迈出了重要一步,它通过集成新的计算技术和既有的检索框架,成功提高了速度和效率,为未来多向量搜索系统的发展提供了可扩展的解决方案。

Multi-vector retrieval has emerged as a critical advancement in information retrieval, particularly with the adoption of transformer-based models. Unlike single-vector retrieval, which encodes queries and documents as a single dense vector, multi-vector retrieval allows for multiple embeddings per document and query. This approach provides a more granular representation, improving search accuracy and retrieval quality. Over time, researchers have developed various techniques to enhance the efficiency and scalability of multi-vector retrieval, addressing computational challenges in handling large datasets.

A central problem in multi-vector retrieval is balancing computational efficiency with retrieval performance. Traditional retrieval techniques are fast but frequently fail to retrieve complex semantic relationships within documents. On the other hand, accurate multi-vector retrieval methods experience high latency mainly because multiple calculations of similarity measures are required. The challenge, therefore, is to make a system such that the desirable features of the multi-vector retrieval are maintained. Yet, the computational overhead is reduced significantly to make a real-time search possible for a large-scale application.

Several improvements have been introduced to enhance efficiency in multi-vector retrieval. ColBERT introduced a late interaction mechanism to optimize retrieval, making query-document interactions computationally efficient. Thereafter, ColBERTv2 and PLAID further elaborated on the idea by introducing higher pruning techniques and optimized kernels in C++. Concurrently, the XTR framework from Google DeepMind has simplified the scoring process without requiring an independent stage for document gathering. However, such models were still efficiency-prone, mainly token retrieval and document scoring, making the associated latency and utilization of resources higher.

A research team from ETH Zurich, UC Berkeley, and Stanford University introduced WARP, a search engine designed to optimize XTR-based ColBERT retrieval. WARP integrates advancements from ColBERTv2 and PLAID while incorporating unique optimizations to improve retrieval efficiency. The key innovations of WARP include WARPSELECT, a method for dynamic similarity imputation that eliminates unnecessary computations, an implicit decompression mechanism that reduces memory operations, and a two-stage reduction process for faster scoring. These enhancements allow WARP to deliver significant speed improvements without compromising retrieval quality.

The WARP retrieval engine uses a structured optimization approach to improve retrieval efficiency. First, it encodes the queries and documents using a fine-tuned T5 transformer and produces token-level embeddings. Then, WARPSELECT decides on the most relevant document clusters for a query while avoiding redundant similarity calculations. Instead of explicit decompression during retrieval, WARP performs implicit decompression to reduce computational overhead significantly. A two-stage reduction method is then used to calculate document scores efficiently. This aggregation of token-level scores and then summing up the document-level scores with dynamically handling missing similarity estimates makes WARP highly efficient compared to other retrieval engines.

WARP significantly improves retrieval performance while reducing query processing time significantly. Experimental results show that WARP reduces end-to-end query latency by 41 times compared with the XTR reference implementation on LoTTE Pooled and brings query response times down from over 6 seconds to 171 milliseconds with a single thread. Moreover, WARP can achieve a threefold speedup over ColBERTv2/PLAID. Index size is also optimized, achieving 2x-4x less storage requirements than the baseline methods. Moreover, WARP outperforms previous retrieval models while keeping high quality across benchmark datasets.

The development of WARP marks a significant step forward in multi-vector retrieval optimization. The research team has successfully improved both speed and efficiency by integrating novel computational techniques with established retrieval frameworks. The study highlights the importance of reducing computational bottlenecks while maintaining retrieval quality. The introduction of WARP paves the way for future improvements in multi-vector search systems, offering a scalable solution for high-speed and accurate information retrieval.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)

The post Researchers from Stanford, UC Berkeley and ETH Zurich Introduces WARP: An Efficient Multi-Vector Retrieval Engine for Faster and Scalable Search appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多向量检索 WARP 信息检索 效率优化 Transformer模型
相关文章