MarkTechPost@AI 2024年09月07日
LongBench-Cite and LongCite-45k: Leveraging CoF (Coarse to Fine) Pipeline to Enhance Long-Context LLMs with Fine-Grained Sentence-Level Citations for Improved QA Accuracy and Trustworthiness
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

为了增强长上下文大型语言模型 (LLM) 在问答 (QA) 任务中的准确性和可信度,研究人员提出了 CoF (粗到精) 方法。该方法旨在通过细粒度的句子级引用来生成详细的引用,从而提高 LLM 生成答案的精确度和易用性。为了评估 LLM 在长上下文问答 (LQAC) 中生成引用的性能,研究人员还开发了 LongBench-Cite,这是一个自动基准,用于评估 LLM 从大型文本语料库中生成引用的性能。为了测试新方法的有效性,研究人员构建了 LongCite-45k,这是一个包含 44,600 个 QA 对和详细的细粒度引用的数据集。

🤔 **CoF (粗到精) 方法**:该方法旨在通过细粒度的句子级引用来生成详细的引用,从而提高 LLM 生成答案的精确度和易用性。CoF 方法通过三个步骤来实现:首先,LLM 根据提供的长文本生成查询和相应的答案。其次,CoF 系统从原始文档中检索相关的文本片段,每个片段包含 128 个标记。然后,将这些片段通过粗粒度引用与模型的答案相关联。最后,系统通过识别和提取直接支持答案的片段中的特定句子来细化这些引用。任何缺乏足够引用支持的答案都会被过滤掉。这种多阶段方法使 CoF 系统能够生成具有精确的句子级引用的响应,从而显着提高用户信任度和引用准确性。

📊 **LongBench-Cite 基准**:这是一个自动基准,用于评估 LLM 从大型文本语料库中生成引用的性能。LongBench-Cite 揭示了当前模型在生成引用方面的改进空间很大,因为许多 LLM 生成的引用与主题无关或过于宽泛。

📚 **LongCite-45k 数据集**:这是一个包含 44,600 个 QA 对和详细的细粒度引用的数据集。该数据集允许 LLM 在需要准确和精确引用的任务上进行训练,从而解决了当前长上下文 QA 模型中的一个关键差距。

🚀 **实验结果**:研究表明,经过 CoF 训练的模型 LongCite-8B 和 LongCite-9B 在引用质量和粒度方面优于现有的专有模型,例如 GPT-4。具体而言,LongCite-8B 和 LongCite-9B 在引用 F1 分数方面比 GPT-4 提高了 6.4% 和 3.6%,引用 F1 分数是用来衡量引用准确性的指标。LongCite 模型的平均引用长度也明显短于专有模型,进一步突出了 CoF 方法的精确性。例如,LongCite-8B 生成的引用的平均长度为 86 个标记,而 GPT-4 的平均长度为 169 个标记。这种粒度水平使用户更容易找到支持模型答案的具体文本。CoF 系统减少了幻觉的发生,因为它使模型能够更一致地使用所有可用的上下文,确保响应更扎根于原始文本。

💡 **结论**:这项研究通过解决引用精度这一长期存在的问题,为长上下文 LLM 领域提供了重要的进步。LongBench-Cite 用于评估 LLM 的引用性能,以及 CoF 系统和 LongCite-45k 数据集的引入,代表了提高 LLM 生成的响应的可信度和可验证性的重要一步。研究人员通过专注于句子级引用而不是宽泛的文本片段,使 LLM 能够生成更准确、更可靠的答案。LongCite-8B 和 LongCite-9B 模型的改进证明了这种方法的有效性,这些模型在引用准确性方面超过了最先进的专有系统。这项进步增强了长上下文 QA 系统的性能,并为实现 LLM 成为更可靠的信息检索和问答任务工具的更广泛目标做出了贡献。

Large language models (LLMs) have become fundamental tools for tasks such as question-answering (QA) and text summarization. These models excel at processing long and complex texts, with capacities reaching over 100,000 tokens. As LLMs are popular for handling large-context tasks, ensuring their reliability and accuracy becomes more pressing. Users rely on LLMs to sift through vast information and provide concise, correct answers. However, many models suffer from the problem of “hallucination,” where they generate information that is unsupported by the provided text. This limitation significantly affects user trust in these models, as the absence of specific, verifiable citations makes it difficult to confirm the correctness of the answers.

A significant challenge in long-context LLMs is their inability to provide fine-grained citations directly linked to specific text parts. Users often face difficulty trusting LLM-generated answers because the models either fail to provide citations altogether or offer citations that refer broadly to entire text sections rather than pinpointing the exact pieces of information supporting the response. This lack of specificity means that even if the answer is accurate, the user must manually search through large chunks of text to verify the correctness. The need for a system that can offer precise, sentence-level citations is crucial for improving the verifiability and trustworthiness of long-context LLMs.

Existing citation methods, though somewhat effective, still have limitations. Some models employ chunk-level citation techniques, where broad text sections are referenced. While useful for reducing the amount of searching required by users, these chunk-based methods do not go far enough in providing the level of detail needed for accurate verification. Other methods include retrieval-augmented generation (RAG) and post-processing systems, where citations are added after the response is generated. However, due to their multi-step processes, these techniques often need to improve answer quality and slow response times. Moreover, the citations provided by these systems are frequently too broad, making them ineffective for users seeking to locate specific supporting information within large documents.

Tsinghua University and Zhipu AI researchers introduced a novel approach to address these limitations through a method called CoF (Coarse to Fine). CoF is designed to generate highly detailed, sentence-level citations, improving the precision and usability of LLM-generated answers. The research team proposed this system as a solution to the problem of broad, imprecise citations, offering a refined approach that provides users with citations linked to specific sentences rather than large text sections. To assess the performance of LLMs in long-context question answering (LQAC), they also developed LongBench-Cite. This automatic benchmark evaluates LLMs’ performance when generating citations from large text corpora. LongBench-Cite revealed significant room for improvement in current models, as many of the citations generated by LLMs were irrelevant or too broadly applied. To test the effectiveness of the new approach, the team built LongCite-45k, a dataset consisting of 44,600 QA pairs with detailed, fine-grained citations. This dataset allows LLMs to train on tasks that require accurate and precise citations, addressing a critical gap in current long-context QA models.

The CoF system functions through steps designed to refine citation accuracy. The process begins with the LLM generating the query and the corresponding answer based on the provided long text. This initial step ensures that the model works with a fully contextualized understanding of the document. Next, the CoF system retrieves relevant chunks of text from the original document, each consisting of 128 tokens. These chunks are then linked to the model’s answer through coarse-grained citations. Finally, the system refines these citations by identifying and extracting the specific sentences within the chunks that directly support the answer. Any answers that lack sufficient citation support are filtered out. This multi-stage approach allows the CoF system to produce responses with precise, sentence-level citations, significantly improving user trust and citation accuracy.

This research demonstrates that CoF-trained models, LongCite-8B and LongCite-9B, outperform existing proprietary models, such as GPT-4, regarding citation quality and granularity. Specifically, LongCite-8B and LongCite-9B achieved a 6.4% and 3.6% improvement over GPT-4 in terms of citation F1 score, a metric used to measure citation accuracy. The average citation length for the LongCite models was also notably shorter than that of proprietary models, further highlighting the precision of the CoF approach. LongCite-8B, for example, generated citations with an average length of 86 tokens, compared to GPT-4’s average of 169 tokens. This level of granularity allows users to locate the specific text supporting the model’s answers more easily. The CoF system reduces the occurrence of hallucinations, as it enables models to more uniformly use all the context available, ensuring that responses are more grounded in the original text.

In conclusion, this research provides a critical advancement in the field of long-context LLMs by addressing a long-standing issue with citation precision. The introduction of LongBench-Cite to assess LLMs’ citation performance, combined with the CoF system and the LongCite-45k dataset, represents a significant step forward in improving the trustworthiness and verifiability of LLM-generated responses. The researchers have enabled LLMs to produce more accurate, reliable answers by focusing on sentence-level citations rather than broad text chunks. The improvements seen in the LongCite-8B and LongCite-9B models demonstrate the effectiveness of this approach, with these models surpassing even the most advanced proprietary systems in citation accuracy. This advancement enhances the performance of long-context QA systems and contributes to the broader goal of making LLMs more dependable tools for information retrieval and question-answering tasks.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

The post LongBench-Cite and LongCite-45k: Leveraging CoF (Coarse to Fine) Pipeline to Enhance Long-Context LLMs with Fine-Grained Sentence-Level Citations for Improved QA Accuracy and Trustworthiness appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 问答 引用 长上下文 CoF LongBench-Cite LongCite-45k
相关文章