未知数据源 2024年10月02日
Catch me if you can! How to beat GPT-4 with a 13B model
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨语言模型中的污染问题,指出现有去污方法的不足,提出LLM去污染器,并通过实验证明其有效性,还强调在实际数据集中存在大量重述样本,建议社区采用更强的去污方法。

🦘指出污染问题的关键:当测试集信息泄露到训练集中,会导致模型性能的高估。常用的n-gram重叠和嵌入相似度搜索等检测方法存在局限,简单的测试数据变化(如重述、翻译)可轻易绕过这些方法。

💡提出新的检测方法:LLM去污染器,包括利用嵌入相似度搜索识别相似训练项并生成潜在重述对,再用高级LLM进行评估,该方法在去除重述样本上效果显著优于现有方法。

📊在实际数据集中的应用:将LLM去污染器应用于广泛使用的实际数据集,发现大量重述样本,如CodeAlpaca、MATH、StarCoder-Data等数据集中都存在污染问题。

📣提出建议与呼吁:建议社区在使用公共基准时采用更强的去污方法,开源LLM去污染器,并呼吁开发新的一次性考试以准确评估LLM。

by: Shuo Yang, Wei-Lin Chiang, Lianmin Zheng*, Joseph E. Gonzalez, Ion Stoica, Nov 14, 2023Announcing Llama-rephraser: 13B models reaching GPT-4 performance in major benchmarks (MMLU/GSK-8K/HumanEval)!To ensure result validity, we followed OpenAI's decontamination method and found no evidence of data contamination.What's the trick behind it? Well, rephrasing the test set is all you need! We simply paraphrase a test sample or translate it into a different language. It turns out a 13B LLM is smart enough to "generalize" beyond such variations and reaches drastically high benchmark performance. So, did we just make a big breakthrough? Apparently, there is something wrong with our understanding of contamination.In this blog post, we point out why contamination is still poorly understood and how existing decontamination measures fail to capture such nuances. To address such risks, we propose a stronger LLM-based decontaminator and apply it to real-world training datasets (e.g., the Stack, RedPajama), revealing significant test overlap with widely used benchmarks.For more technical details, please refer to our paper.What's Wrong with Existing Decontamination Measures?Contamination occurs when test set information is leaked in the training set, resulting in an overly optimistic estimate of the model’s performance.Despite being recognized as a crucial issue, understanding and detecting contamination remains an open and challenging problem.The most commonly used approaches are n-gram overlap and embedding similarity search.N-gram overlap relies on string matching to detect contamination, widely used by leading developments such as GPT-4, PaLM, and Llama-2.Embedding similarity search uses the embeddings of pre-trained models (e.g., BERT) to find similar and potentially contaminated examples.However, we show that simple variations of the test data (e.g., paraphrasing, translation) can easily bypass existing simple detection methods.We refer to such variations of test cases as Rephrased Samples.Below we demonstrate a rephrased sample from the MMLU benchmark. We show that if such samples are included in the training set, a 13B model can reach drastically high performance (MMLU 85.9).Unfortunately, existing detection methods (e.g., n-gram overlap, embedding similarity) fail to detect such contamination. The embedding similarity approach struggles to distinguish the rephrased question from other questions in the same subject (high school US history).With similar rephrasing techniques, we observe consistent results in widely used coding and math benchmarks such as HumanEval and GSM-8K (shown in the cover figure). Therefore, being able to detect such rephrased samples becomes critical.Stronger Detection Method: LLM DecontaminatorTo address the risk of possible contamination, we propose a new contamination detection method “LLM decontaminator”.This LLM decontaminator involves two steps:For each test case, LLM decontaminator identifies the top-k training items with the highest similarity using the embedding similarity search.From these items, LLM decontaminator generates k potential rephrased pairs. Each pair is evaluated for rephrasing using an advanced LLM, such as GPT-4.Results show that our proposed LLM method works significantly better than existing methods on removing rephrased samples.Evaluating Different Detection MethodsTo compare different detection methods, we use MMLU benchmark to construct 200 prompt pairs using both the original and rephrased test sets. These comprised 100 random pairs and 100 rephrased pairs.The f1 score on these pairs provides insight into the detection methods' ability to detect contamination, with higher values indicating more precise detection.As shown in the following table, except for the LLM decontaminator, all other detection methods introduce some false positives. Both rephrased and translated samples successfully evade the n-gram overlap detection. With multi-qa BERT, the embedding similarity search proves ineffective against translated samples. Our proposed LLM decontaminator is more robust in all cases with the highest f1 scores.Contamination in Real-World DatasetWe apply the LLM decontaminator to widely used real-world datasets (e.g., the Stack, RedPajama, etc) and identify a substantial amount of rephrased samples. The table below displays the contamination percentage of different benchmarks in each training dataset.Below we show some detected samples.CodeAlpaca contains 20K instruction-following synthetic data generated by GPT, which is widely used for instruction fine-tuning (e.g., Tulu).A rephrased example in CodeAlpaca is shown below.This suggests contamination may subtly present in synthetic data generated by LLMs. In the Phi-1 report, they also discover such semantically similar test samples that are undetectable by n-gram overlap.MATH is a widely recognized math training dataset that spans various mathematical domains, including algebra, geometry, and number theory.Surprisingly, we even find contamination between the train-test split in the MATH benchmark as shown below.StarCoder-Data is used for training StarCoder and StarCoderBase, and it contains 783GB of code in 86 programming languages. In the StarCoder paper, the code training data was decontaminated by removing files that contained docstrings or solutions from HumanEval. However, there are still some samples detected by LLM decontaminator.Use LLM Decontaminator to Scan Your DataBased on the above study, we suggest the community adopt a stronger decontamination method when using any public benchmarks. Our proposed LLM decontaminator is open-sourced on GitHub.Here we show how to remove rephrased samples from training data using the LLM decontaminator tool. The following example can be found here.Pre-process training data and test data.The LLM decontaminator accepts the dataset in jsonl format, with each line corresponding to a {"text": data} entry.Run End2End detection.The following command builds a top-k similar database based on sentence bert and uses GPT-4 to check one by one if they are rephrased samples. You can select your embedding model and detection model by modifying the parameters.ConclusionIn this blog, we show that contamination is still poorly understood. With our proposed decontamination method, we reveal significant previously unknown test overlap in real-world datasets. We encourage the community to rethink benchmark and contamination in LLM context, and adopt stronger decontamination tools when evaluating LLMs on public benchmarks.Moreover, we call for the community to actively develop fresh one-time exams to accurately evaluate LLMs. Learn more about our ongoing effort on live LLM eval at Chatbot Arena!AcknowledgmentWe would like to express our gratitude to Ying Sheng for the early discussion on rephrased samples.We also extend our thanks to Dacheng Li, Erran Li, Hao Liu, Jacob Steinhardt, Hao Zhang, and Siyuan Zhuang for providing insightful feedback.Citation@misc{yang2023rethinking, title={Rethinking Benchmark and Contamination for Language Models with Rephrased Samples}, author={Shuo Yang and Wei-Lin Chiang and Lianmin Zheng and Joseph E. Gonzalez and Ion Stoica}, year={2023}, eprint={2311.04850}, archivePrefix={arXiv}, primaryClass={cs.CL}}

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型 污染问题 LLM去污染器 实际数据集 评估准确性
相关文章