Researchers at the University of Wisconsin-Madison Propose a Finetuning Approach Utilizing a Carefully Designed Synthetic Dataset Comprising Numerical Key-Value Retrieval Tasks

MarkTechPost@AI 2024年07月03日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

威斯康星大学麦迪逊分校的研究人员提出了一种新的微调方法，利用精心设计的合成数据集来解决大型语言模型（LLM）在处理长文本输入时遇到的挑战。该数据集包含数值键值检索任务，旨在提高LLM处理长文本的能力。通过使用合成数据，研究人员希望在不引入幻觉的情况下，提高LLM的信息检索和推理能力。

😁 **解决长文本问题:** 研究人员发现，大型语言模型（LLM）在处理长文本输入时，往往难以从较长的输入上下文中检索相关信息，表现出“中间丢失”行为。这限制了它们在多文档问答 (MDQA) 和灵活长度问答 (FLenQA) 等需要处理和推理长篇文本的任务中的有效性。传统的解决方法是使用真实世界数据集进行微调，但这些数据集往往包含过时或无关的信息，会导致幻觉和其他不准确性。

🤔 **合成数据集的优势:** 研究团队提出了一种新的微调方法，使用精心设计的合成数据集来解决这些挑战。该数据集包含数值键值检索任务，旨在增强LLM处理长文本的能力。通过使用合成数据，可以避免真实世界数据集的缺点，例如过时或无关的信息，从而提高LLM的信息检索和推理能力，同时减少幻觉的产生。该合成数据集包含简单的字典键值检索任务，每个任务包含多个字典，每个字典包含几个键。例如，用于 Mistral 7B 的数据集包含 350 个样本，每个样本包含 85 个字典，从而产生大约 3900 个标记的提示。微调在这些任务的答案部分进行，屏蔽其他元素，以集中模型的学习过程。

🚀 **实验结果显著:** 实验表明，这种方法显著提高了LLM在长文本任务中的性能。例如，在合成数据上微调 GPT-3.5 Turbo，在 20 文档 MDQA 基准测试中，第 10 位的性能提高了 10.5%。此外，该方法缓解了“中间丢失”现象，减少了首因效应，导致在整个输入上下文中更准确的信息检索。在合成数据上微调的模型的性能与在真实世界数据集上微调的模型的性能进行了比较，合成方法在保持不同上下文位置的一致准确性方面显示出更好的结果。这项研究提出了一种使用合成数据微调 LLM 的创新方法，显著提高了它们在长文本环境中的性能。该方法通过解决“中间丢失”现象和减少首因效应，在传统微调技术上取得了重大改进。这项研究突出了合成数据集在克服真实世界数据局限性方面的潜力，为在处理大量文本信息时更有效、更可靠的 LLM 铺平了道路。

It is observed that LLMs often struggle to retrieve relevant information from the middle of long input contexts, exhibiting a “lost-in-the-middle” behavior. The research paper addresses the critical issue of the performance of large language models (LLMs) when handling longer-context inputs. Specifically, LLMs like GPT-3.5 Turbo and Mistral 7B often struggle with accurately retrieving information and maintaining reasoning capabilities across extensive textual data. This limitation hampers their effectiveness in tasks that require processing and reasoning over long passages, such as multi-document question answering (MDQA) and flexible length question answering (FLenQA).

Current methods to enhance the performance of LLMs in long-context settings typically involve finetuning on real-world datasets. However, these datasets often include outdated or irrelevant information, which can lead to hallucinations and other inaccuracies. Traditional datasets such as MDQA and FLenQA have shown that LLMs tend to exhibit a “lost-in-the-middle” behavior, where their performance is optimal at the beginning or end of the input context but deteriorates for information in the middle.

A team of researchers from the University of Wisconsin-Madison proposes a novel finetuning approach utilizing a carefully designed synthetic dataset to address these challenges. This dataset comprises numerical key-value retrieval tasks designed to enhance the LLMs’ ability to handle long contexts more effectively. By using synthetic data that avoids the pitfalls of outdated or irrelevant information, the researchers aim to improve LLMs’ information retrieval and reasoning capabilities without introducing hallucinations.

The proposed synthetic dataset consists of simple dictionary key-value retrieval tasks, where each task involves multiple dictionaries with a few keys each. For instance, the dataset for Mistral 7B includes 350 samples, each containing 85 dictionaries, resulting in prompts with roughly 3900 tokens. Finetuning is conducted on the answer part of these tasks, masking out other elements to focus the model’s learning process.

Experiments demonstrate that this approach significantly enhances the performance of LLMs in long-context tasks. For example, finetuning GPT-3.5 Turbo on the synthetic data resulted in a 10.5% improvement on the 20 documents MDQA benchmark at the tenth position. Moreover, this method mitigates the “lost-in-the-middle” phenomenon and reduces the primacy bias, leading to more accurate information retrieval across the entire input context. The performance of models finetuned on the synthetic data was compared against those finetuned on real-world datasets, with the synthetic approach showing superior results in maintaining consistent accuracy across different context positions.

The study introduces an innovative approach to finetuning LLMs using synthetic data, significantly enhancing their performance in long-context settings. The proposed method demonstrates substantial improvements over traditional finetuning techniques by addressing the “lost-in-the-middle” phenomenon and reducing primacy bias. This research highlights the potential of synthetic datasets in overcoming the limitations of real-world data, paving the way for more effective and reliable LLMs in handling extensive textual information.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter.

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit

The post Researchers at the University of Wisconsin-Madison Propose a Finetuning Approach Utilizing a Carefully Designed Synthetic Dataset Comprising Numerical Key-Value Retrieval Tasks appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签