MarkTechPost@AI 2024年11月04日
pEBR: A Novel Probabilistic Embedding based Retrieval Model to Address the Challenges of Insufficient Retrieval for Head Queries and Irrelevant Retrieval for Tail Queries
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章介绍了基于嵌入的检索方法,指出其目标是创建共同语义空间,使查询和项目能以密集向量表示,实现基于语义相似性的有效匹配。还提到现有检索系统的局限性,以及为克服这些局限性而提出的概率嵌入基于检索(pEBR)模型,该模型能根据每个查询的相关项目分布动态调整检索过程,提高召回率和精度。最后总结了研究的主要贡献。

🎯基于嵌入的检索旨在创建共同语义空间,将查询和项目表示为密集向量,依据语义相似性进行有效匹配,近似最近邻方法可提高在大数据集中查找相关对象的速度和效率。

💡现有检索系统在多数工业应用中按固定数量检索项目,但存在局限性,如热门查询可能需要更广泛结果,而特定查询可能返回过多无关结果,部分原因是常用的频度学技术创建的损失函数未考虑查询类型的差异。

🚀为克服局限性,研究团队提出pEBR模型,该概率方法取代频度学方法,根据每个查询的相关项目分布动态调整检索过程,使用概率累积分布函数确定针对每个查询的动态余弦相似度阈值,提高了召回率和精度。

📋研究团队总结了主要贡献,包括引入基于嵌入检索的常规方法、描述检索系统中的基本技术、提出改进检索性能的损失函数、通过实验证明所提方法的有效性、通过消融研究了解模型各组件对整体性能的影响。

Creating a common semantic space where queries and items can be represented as dense vectors is the main goal of embedding-based retrieval. Instead of depending on precise keyword matches, this method enables effective matching based on semantic similarities. Semantically related things are positioned closer to one another in this common area since searches and items are embedded in this manner. Approximate Nearest Neighbour (ANN) methods, which greatly improve the speed and effectiveness of locating pertinent objects within big datasets, are made possible by this.

Retrieval systems are made to retrieve a certain amount of items per query in the majority of industrial applications. However, this consistent retrieval strategy has limitations. Popular or head inquiries, like those pertaining to well-known products, could, for instance, need a wider range of results in order to fully capture the range of pertinent objects. The low recall could arise from a set cutoff for these searches, which would leave out some pertinent items. On the other hand, the system could return too many irrelevant results for more focused or tail queries, which usually contain fewer pertinent things, decreasing precision. The common use of frequentist techniques for creating loss functions, which frequently fail to take into consideration the variation among various query types, is partly to blame for this difficulty.

To overcome these limitations, a team of researchers has introduced Probabilistic Embedding-Based Retrieval (pEBR), a probabilistic approach that replaces the frequentist approach. Instead of handling every question in the same way, pEBR dynamically modifies the retrieval procedure according to the distribution of pertinent items that underlie each inquiry. In particular, pEBR uses a probabilistic cumulative distribution function (CDF) to determine a dynamic cosine similarity threshold customized for every query. The retrieval system is able to define adaptive thresholds that better meet the unique requirements of each query by modeling the likelihood of relevant items for each query. This enables the retrieval system to capture more relevant things for head queries and filter out irrelevant ones for tail queries.

The team has shared that according to experimental findings, this probabilistic method enhances recall, i.e., the comprehensiveness of results, and precision, ie.., the relevance of results. Furthermore, ablation tests, which methodically eliminate model components to assess their effects, have demonstrated that pEBR’s effectiveness is largely dependent on its capacity to adaptively differentiate between head and tail queries. pEBR has overcome the drawbacks of fixed cutoffs by capturing the distinct distribution of pertinent items for every query, offering a more accurate and adaptable retrieval experience for a variety of query patterns.

The team has summarized their primary contributions as follows. 

    The two-tower paradigm, in which items and questions are represented in the same semantic space, has been introduced as the conventional method for embedding-based retrieval.
    Popular point-wise and pair-wise loss functions in retrieval systems have been characterized as fundamental techniques.
    The study has suggested loss functions based on contrastive and maximum likelihood estimation to improve retrieval performance.
    The usefulness of the suggested approach has been demonstrated by experiments, which revealed notable gains in retrieval accuracy.
    Ablation research has examined the model’s constituent parts to understand how each component affects overall performance.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Sponsorship Opportunity with us] Promote Your Research/Product/Webinar with 1Million+ Monthly Readers and 500k+ Community Members

The post pEBR: A Novel Probabilistic Embedding based Retrieval Model to Address the Challenges of Insufficient Retrieval for Head Queries and Irrelevant Retrieval for Tail Queries appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

概率嵌入检索 pEBR模型 检索系统 损失函数
相关文章