Rethinking QA Dataset Design: How Popular Knowledge Enhances LLM Accuracy?

MarkTechPost@AI 2024年07月05日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

大型语言模型（LLM）因其在预训练期间将大量事实知识存储在其权重中的能力而备受关注。这种能力在知识密集型任务中取得了可喜的成果，特别是事实性问答。然而，一个关键挑战仍然存在：LLM 经常对查询生成看似合理但错误的答案，从而损害了它们的可靠性。事实准确性方面的不一致性，在知识型应用中大规模采用和信任 LLM 方面构成了重大障碍。研究人员正在努力解决提高 LLM 输出事实性的挑战，同时保持其多功能性和生成能力。更复杂的是，即使 LLM 拥有正确的信息，它们也可能仍然产生不准确的答案，这表明知识检索和应用方面存在潜在问题。

😊 研究发现，LLM 微调示例对事实性的影响取决于预训练模型中事实的编码程度。在编码良好的事实上进行微调会显著提高事实性，而使用编码较差的事实可能会损害性能。这种现象的发生是因为 LLM 可以使用记忆的知识或依赖于一般的“捷径”来回答问题。微调数据的组成决定了哪种机制被放大。众所周知的知识会强化记忆知识的使用，而不太熟悉的知识则会鼓励使用捷径。这一见解为通过战略性地选择微调数据来提高 LLM 事实性提供了新的视角。

🤔 研究人员通过使用合成设置来研究微调数据对 LLM 事实性的影响。这种设置模拟了主题、关系和答案的简化标记空间，预训练和下游任务之间存在不同的格式。预训练样本是从主题的 Zipf 分布和关系的均匀分布中提取的。关键发现表明，对流行事实进行微调会显著提高事实性，并且对不太流行的实体的影响会放大。研究考察了 Zipf 分布参数和预训练步骤对这种现象的影响。这些观察结果导致了“事实显著性”的概念，它代表了模型对事实的了解程度，这会影响微调行为和下游性能。这种合成方法允许对预训练过程进行受控调查，而这些过程对于真实的大型语言模型来说是不切实际的。

🧐 在多个数据集（PopQA、实体问题和 MMLU）和模型（Llama-7B 和 Mistral）上的实验结果一致表明，与使用流行知识相比，在不太流行或不太自信的示例上进行微调的性能较差。对于不太流行的测试点，这种性能差距会扩大，支持了不太流行的事实更容易受到微调选择影响的假设。令人惊讶的是，即使是随机选择的子集也优于在最不受欢迎的知识上进行微调，这表明包含一些流行的知识可以减轻不太流行的知识的负面影响。此外，在最流行知识的较小子集上进行训练，通常与使用整个数据集相比具有可比性或更好的性能。这些发现表明，仔细选择微调数据，重点关注众所周知的事实，可以提高 LLM 的事实准确性，从而有可能实现更有效和高效的训练过程。

🤩 本研究为通过战略性问答数据集组成来提高语言模型的事实性提供了重要见解。与直观的假设相反，在众所周知的事实上进行微调始终如一地增强了整体事实性。这一发现是在各种环境中观察到的，并得到概念模型的支持，挑战了问答数据集设计中的传统方法。这项研究为提高语言模型性能开辟了新的途径，表明正则化技术在克服注意力不平衡、课程学习策略以及开发合成数据以实现高效知识提取方面具有潜在优势。这些发现为未来旨在增强语言模型在各种应用中的事实准确性和可靠性的工作奠定了基础。

🤯 知识是力量，但更重要的是，如何有效地应用知识。

Large language models (LLMs) have gained significant attention for their ability to store vast amounts of factual knowledge within their weights during pretraining. This capability has led to promising results in knowledge-intensive tasks, particularly factual question-answering. However, a critical challenge persists: LLMs often generate plausible but incorrect responses to queries, undermining their reliability. This inconsistency in factual accuracy poses a significant hurdle in the widespread adoption and trust of LLMs for knowledge-based applications. Researchers are grappling with the challenge of improving the factuality of LLM outputs while maintaining their versatility and generative capabilities. The problem is further complicated by the observation that even when LLMs possess the correct information, they may still produce inaccurate answers, suggesting underlying issues in knowledge retrieval and application.

Researchers have attempted various approaches to improve factuality in LLMs. Some studies focus on the impact of unfamiliar examples during fine-tuning, revealing that these can potentially worsen factuality due to overfitting. Other approaches examine the reliability of factual knowledge, showing LLMs often underperform on obscure information. Techniques to enhance factuality include manipulating attention mechanisms, using unsupervised internal probes, and developing methods for LLMs to abstain from answering uncertain questions. Some researchers have introduced fine-tuning techniques to encourage LLMs to refuse questions outside their knowledge boundaries. Also, studies have investigated LLM mechanisms and training dynamics, examining how facts are stored and extracted, and analyzing pretraining dynamics of syntax acquisition and attention patterns. Despite these efforts, challenges in achieving consistent factual accuracy persist.

In this study, researchers from the Department of Machine Learning, at Carnegie Mellon University and the Department of Computer Science, at Stanford University found that the impact of fine-tuning examples on LLMs depends critically on how well the facts are encoded in the pre-trained model. Fine-tuning on well-encoded facts significantly improves factuality, while using less well-encoded facts can harm performance. This phenomenon occurs because LLMs can either use memorized knowledge or rely on general “shortcuts” to answer questions. The composition of fine-tuning data determines which mechanism is amplified. Well-known facts reinforce the use of memorized knowledge, while less familiar facts encourage shortcut usage. This insight provides a new perspective on improving LLM factuality through strategic selection of fine-tuning data.

The method utilizes a synthetic setup to study the impact of fine-tuning data on LLM factuality. This setup simulates a simplified token space for subjects, relations, and answers, with different formatting between pretraining and downstream tasks. Pretraining samples are drawn from a Zipf distribution for subjects and a uniform distribution for relations. Key findings reveal that fine-tuning popular facts significantly improves factuality, with effects amplified for less popular entities. The study examines the influence of the Zipf distribution parameter and pretraining steps on this phenomenon. These observations lead to the concept of “fact salience,” representing how well a model knows a fact, which influences fine-tuning behavior and downstream performance. This synthetic approach allows for a controlled investigation of pretraining processes that would be impractical with real large language models.

Experimental results across multiple datasets (PopQA, Entity-Questions, and MMLU) and models (Llama-7B and Mistral) consistently show that fine-tuning on less popular or less confident examples underperforms compared to using popular knowledge. This performance gap widens for less popular test points, supporting the hypothesis that less popular facts are more sensitive to fine-tuning choices. Surprisingly, even randomly selected subsets outperform fine-tuning on the least popular knowledge, suggesting that including some popular facts can mitigate the negative impact of less popular ones. Also, training on a smaller subset of the most popular facts often performs comparably or better than using the entire dataset. These findings indicate that careful selection of fine-tuning data, focusing on well-known facts, can lead to improved factual accuracy in LLMs, potentially allowing for more efficient and effective training processes.

The study provides significant insights into improving language model factuality through strategic QA dataset composition. Contrary to intuitive assumptions, finetuning on well-known facts consistently enhances overall factuality. This finding, observed across various settings and supported by a conceptual model, challenges conventional approaches to QA dataset design. The research opens new avenues for improving language model performance, suggesting potential benefits in regularization techniques to overcome attention imbalance, curriculum learning strategies, and the development of synthetic data for efficient knowledge extraction. These findings provide a foundation for future work aimed at enhancing the factual accuracy and reliability of language models in diverse applications.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter.

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Rethinking QA Dataset Design: How Popular Knowledge Enhances LLM Accuracy? appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签