MarkTechPost@AI 2024年10月09日
Enhancing Text Retrieval: Overcoming the Limitations with Contextual Document Embeddings
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨机器学习中文本检索面临的挑战,介绍了多种改进尝试,如开发多种文本嵌入模型、提出新的适应方法等。康奈尔大学的研究者提出解决当前文本检索模型局限性的方法,包括引入新的对比学习目标和新的上下文架构,还介绍了实验设置及结果。

🎯传统文本检索方法存在局限,如依赖稀疏词汇匹配的BM25方法,难以捕捉语义关系和上下文;双编码器架构虽将文档和查询编码到密集潜在空间,但利用先前语料库统计信息的能力有待提高。

💪为改进检索性能,开发了多种Biencoder文本嵌入模型,且有多种努力来使其适应新语料库,如无监督跨度采样、在测试语料库上训练等。

🌟康奈尔大学研究者提出解决当前模型局限的方法,开发两种互补方法以创建上下文文档嵌入,包括引入新对比学习目标和新上下文架构,并采用两阶段训练approach。

🎉该方法的上下文批处理方式显示批处理难度与下游性能有强相关性,上下文架构提高了所有下游数据集的性能,模型在多个领域获得了最佳性能。

Text retrieval in machine learning faces significant challenges in developing effective methods for indexing and retrieving documents. Traditional approaches relied on sparse lexical matching methods like BM25, which used n-gram frequencies. However, these statistical models have limitations in capturing semantic relationships and context. The primary neural method, a dual encoder architecture, encodes documents and queries into a dense latent space for retrieval. However, it needs to improve the ability to easily utilize previous corpus statistics such as inverse document frequency (IDF). This limitation makes neural models less adaptable to specific retrieval domains, as they need more context dependence than statistical models.

Researchers have made various attempts to address the challenges in text retrieval. Biencoder text embedding models like DPR, GTR, Contriever, LaPraDoR, Instructor, Nomic-Embed, E5, and GTE have been developed to improve retrieval performance. Some efforts have focused on adapting these models to new corpora at test time, proposing solutions such as unsupervised span-sampling, training on test corpora, and distillation from re-rankers. Moreover, other approaches include query clustering before training and considering contrastive batch sampling as a global optimization problem. Test-time adaptation techniques like pseudo-relevance feedback have also been explored, where relevant documents are used to enhance query representation.

Researchers from Cornell University have proposed an approach to address the limitations of current text retrieval models. Researchers argue that existing document embeddings lack context for targeted retrieval use cases and suggest that document embeddings should consider both the document itself and its neighboring documents. Two complementary methods are developed to achieve this, for creating contextualized document embeddings. The first method introduces an alternative contrastive learning objective that explicitly adds document neighbors into the intra-batch contextual loss. The second method presents a new contextual architecture that directly encodes neighboring document information into the representation. 

The proposed method utilizes a two-phase training approach: a large weakly-supervised pre-training phase and a short supervised phase. The initial setup to conduct experiments uses a small setting with a six-layer transformer, a maximum sequence length of 64, and up to 64 additional contextual tokens. This is evaluated on a truncated version of the BEIR benchmark, with various batch and cluster sizes. For the large setting, a single model is trained on sequences of length 512 with 512 contextual documents and evaluated on the full MTEB benchmark. The training data included 200M weakly supervised data points from internet sources and 1.8M human-written query-document pairs from retrieval datasets. The model uses NomicBERT as its backbone, with 137M parameters.

The contextual batching approach demonstrated a strong correlation between batch difficulty and downstream performance, where harder batches in contrastive learning lead to better gradient approximation and more effective learning. The contextual architecture has improved performance across all downstream datasets, with improvements in smaller, out-of-domain datasets like ArguAna and SciFact. The model gains optimal performance when trained on a full scale after four epochs on the BGE meta-datasets. The model “cde-small-v1” obtained state-of-the-art results on the MTEB benchmark compared to same-size models, showing enhanced embedding performance across multiple domains like clustering, classification, and semantic similarity.

In this paper, researchers from Cornell University have proposed a method to address the limitations of current text retrieval models. This paper consists of two significant improvements to traditional “biencoder” models for generating embeddings. The first enhancement introduces an algorithm for reordering training data points to create more challenging batches, which enhances vanilla training with minimal modifications. The second improvement introduces a corpus-aware architecture for retrieval, enabling the training of a state-of-the-art text embedding model. This contextual architecture effectively incorporates neighboring document information, addressing the limitations of context-independent embeddings.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post Enhancing Text Retrieval: Overcoming the Limitations with Contextual Document Embeddings appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

文本检索 康奈尔大学 上下文文档嵌入 模型改进
相关文章