Unite.AI 05月16日 21:37
Large Language Models Are Memorizing the Datasets Meant to Test Them
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项新研究表明,如果AI依赖记忆而非真正学习,其推荐结果可能存在偏差。研究人员发现,一些AI模型在训练过程中“记住”了用于评估的数据集,导致性能被高估,推荐结果可能过时或与用户需求不符。该研究重点关注MovieLens-1M数据集,发现多个AI模型已部分记忆该数据集,这使得基于这些模型的推荐结果的准确性值得怀疑,因为看似智能的推荐可能仅仅是数据回忆的结果。

🧠 研究发现,一些AI推荐系统可能过度依赖记忆,而非真正学习如何做出有用的推荐。

🎬 研究人员关注了MovieLens-1M数据集,发现多个AI模型在训练中“记住”了该数据集中的信息,包括电影名称、用户属性和交互历史。

🔍 通过测试,研究人员发现GPT-4o等模型能够通过简单的提示,回忆起MovieLens-1M数据集中近80%的电影名称。

📈 记忆数据集会导致AI推荐系统的性能被高估,因为它们在测试中表现出色,实际上是“作弊”——直接回忆训练数据。

⚠️ 这意味着看似有用的推荐可能只是统计上的“回声”,反映了模型之前接触过的数据,而非对用户偏好的真正理解。

If you rely on AI to recommend what to watch, read, or buy, new research indicates that some systems may be basing these results from memory rather than skill: instead of learning to make useful suggestions, the models often recall items from the datasets used to evaluate them, leading to overestimated performance and recommendations that may be outdated or poorly-matched to the user.

 

In machine learning, a test-split is used to see if a trained model has learned to solve problems that are similar, but not identical to the material it was trained on.

So if a new AI ‘dog-breed recognition' model is trained on a dataset of 100,000 pictures of dogs, it will usually feature an 80/20 split – 80,000 pictures supplied to train the model; and 20,000 pictures held back and used as material for testing the finished model.

Obvious to say, if the AI's training data inadvertently includes the ‘secret' 20% section of test split, the model will ace these tests, because it already knows the answers (it has already seen 100% of the domain data). Of course, this does not accurately reflect how the model will perform later, on new ‘live' data, in a production context.

Movie Spoilers

The problem of AI cheating on its exams has grown in step with the scale of the models themselves. Because today's systems are trained on vast, indiscriminate web-scraped corpora such as Common Crawl, the possibility that benchmark datasets (i.e., the held-back 20%) slip into the training mix is no longer an edge case, but the default – a syndrome known as data contamination; and at this scale, the manual curation that could catch such errors is logistically impossible.

This case is explored in a new paper from Italy's Politecnico di Bari, where the researchers focus on the outsized role of a single movie recommendation dataset, MovieLens-1M, which they argue has been partially memorized by several leading AI models during training.

Because this particular dataset is so widely used in the testing of recommender systems, its presence in the models’ memory potentially makes those tests meaningless: what appears to be intelligence may in fact be simple recall, and what looks like an intuitive recommendation skill may just be a statistical echo reflecting earlier exposure.

The authors state:

‘Our findings demonstrate that LLMs possess extensive knowledge of the MovieLens-1M dataset, covering items, user attributes, and interaction histories. Notably, a simple prompt enables GPT-4o to recover nearly 80% of [the names of most of the movies in the dataset].

‘None of the examined models are free of this knowledge, suggesting that MovieLens-1M data is likely included in their training sets. We observed similar trends in retrieving user attributes and interaction histories.'

The brief new paper is titled Do LLMs Memorize Recommendation Datasets? A Preliminary Study on MovieLens-1M, and comes from six Politecnico researchers. The pipeline to reproduce their work has been made available at GitHub.

Method

To understand whether the models in question were truly learning or simply recalling, the researchers began by defining what memorization means in this context, and began by testing whether a model was able to retrieve specific pieces of information from the MovieLens-1M dataset, when prompted in just the right way.

If a model was shown a movie’s ID number and could produce its title and genre, that counted as memorizing an item; if it could generate details about a user (such as age, occupation, or zip code) from a user ID, that also counted as user memorization; and if it could reproduce a user’s next movie rating from a known sequence of prior ones, it was taken as evidence that the model may be recalling specific interaction data, rather than learning general patterns.

Each of these forms of recall was tested using carefully written prompts, crafted to nudge the model without giving it new information. The more accurate the response, the more likely it was that the model had already encountered that data during training:

Zero-shot prompting for the evaluation protocol used in the new paper. Source: https://arxiv.org/pdf/2505.10212

Data and Tests

To curate a suitable dataset, the authors surveyed recent papers from two of the field’s major conferences, ACM RecSys 2024 , and ACM SIGIR 2024. MovieLens-1M appeared most often, cited in just over one in five submissions. Since earlier studies had reached similar conclusions,  this was not a surprising result, but rather a confirmation of the dataset’s dominance.

MovieLens-1M consists of three files: Movies.dat, which lists movies by ID, title, and genre; Users.dat, which maps user IDs to basic biographical fields; and Ratings.dat, which records who rated what, and when.

To find out whether this data had been memorized by large language models, the researchers turned to prompting techniques first introduced in the paper Extracting Training Data from Large Language Models, and later adapted in the subsequent work Bag of Tricks for Training Data Extraction from Language Models.

The method is direct: pose a question that mirrors the dataset format and see if the model answers correctly. Zero-shot, Chain-of-Thought, and few-shot prompting were tested, and it was found that the last method, in which the model is shown a few examples, was the most effective; even if more elaborate approaches might yield higher recall, this was considered sufficient to reveal what had been remembered.

Few-shot prompt used to test whether a model can reproduce specific MovieLens-1M values when queried with minimal context.

To measure memorization, the researchers defined three forms of recall: item, user, and interaction. These tests examined whether a model could retrieve a movie title from its ID, generate user details from a UserID, or predict a user's next rating based on earlier ones. Each was scored using a coverage metric* that reflected how much of the dataset could be reconstructed through prompting.

The models tested were GPT-4o; GPT-4o mini; GPT-3.5 turbo; Llama-3.3 70B; Llama-3.2 3B; Llama-3.2 1B; Llama-3.1 405B; Llama-3.1 70B; and Llama-3.1 8B. All were run with temperature set to zero, top_p set to one, and both frequency and presence penalties disabled. A fixed random seed ensured consistent output across runs.

Proportion of MovieLens-1M entries retrieved from movies.dat, users.dat, and ratings.dat, with models grouped by version and sorted by parameter count.

To probe how deeply MovieLens-1M had been absorbed, the researchers prompted each model for exact entries from the dataset’s three (aforementioned) files: Movies.dat, Users.dat, and Ratings.dat.

Results from the initial tests, shown above, reveal sharp differences not only between GPT and Llama families, but also across model sizes. While GPT-4o and GPT-3.5 turbo recover large portions of the dataset with ease, most open-source models recall only a fraction of the same material, suggesting uneven exposure to this benchmark in pretraining.

These are not small margins. Across all three files, the strongest models did not simply outperform weaker ones, but recalled entire portions of MovieLens-1M.

In the case of GPT-4o, the coverage was high enough to suggest that a nontrivial share of the dataset had been directly memorized.

The authors state:

‘Our findings demonstrate that LLMs possess extensive knowledge of the MovieLens-1M dataset, covering items, user attributes, and interaction histories.

‘Notably, a simple prompt enables GPT-4o to recover nearly 80% of MovieID::Title records. None of the examined models are free of this knowledge, suggesting that MovieLens-1M data is likely included in their training sets.

‘We observed similar trends in retrieving user attributes and interaction histories.'

Next, the authors tested for the impact of memorization on recommendation tasks by prompting each model to act as a recommender system. To benchmark performance, they compared the output against seven standard methods: UserKNN; ItemKNN; BPRMF; EASER; LightGCN; MostPop; and Random.

The MovieLens-1M dataset was split 80/20 into training and test sets, using a leave-one-out sampling strategy to simulate real-world usage. The metrics used were Hit Rate (HR@[n]); and nDCG/a>(@<em[n]):

Recommendation accuracy on standard baselines and LLM-based methods. Models are grouped by family and ordered by parameter count, with bold values indicating the highest score within each group.

Here several large language models outperformed traditional baselines across all metrics, with GPT-4o establishing a wide lead in every column, and even mid-sized models such as GPT-3.5 turbo and Llama-3.1 405B consistently surpassing benchmark methods such as BPRMF and LightGCN.

Among smaller Llama variants, performance varied sharply, but Llama-3.2 3B stands out, with the highest HR@1 in its group.

The results, the authors suggest, indicate that memorized data can translate into measurable advantages in recommender-style prompting, particularly for the strongest models.

In an additional observation, the researchers continue:

‘Although the recommendation performance appears outstanding, comparing Table 2 with Table 1 reveals an interesting pattern. Within each group, the model with higher memorization also demonstrates superior performance in the recommendation task.

‘For example, GPT-4o outperforms GPT-4o mini, and Llama-3.1 405B surpasses Llama-3.1 70B and 8B.

‘These results highlight that evaluating LLMs on datasets leaked in their training data may lead to overoptimistic performance, driven by memorization rather than generalization.'

Regarding the impact of model scale on this issue, the authors observed a clear correlation between size, memorization, and recommendation performance, with larger models not only retaining more of the MovieLens-1M dataset, but also performing more strongly in downstream tasks.

Llama-3.1 405B, for example, showed an average memorization rate of 12.9%, while Llama-3.1 8B retained only 5.82%. This nearly 55% reduction in recall corresponded to a 54.23% drop in nDCG and a 47.36% drop in HR across evaluation cutoffs.

The pattern held throughout – where memorization decreased, so did apparent performance:

‘These findings suggest that increasing the model scale leads to greater memorization of the dataset, resulting in improved performance.

‘Consequently, while larger models exhibit better recommendation performance, they also pose risks related to potential leakage of training data.'

The final test examined whether memorization reflects the popularity bias baked into MovieLens-1M. Items were grouped by frequency of interaction, and the chart below shows that larger models consistently favored the most popular entries:

Item coverage by model across three popularity tiers: top 20% most popular; middle 20% moderately popular; and the bottom 20% least interacted items.

GPT-4o retrieved 89.06% of top-ranked items but only 63.97% of the least popular. GPT-4o mini and smaller Llama models showed much lower coverage across all bands. The researchers state that this trend suggests that memorization not only scales with model size, but also amplifies preexisting imbalances in the training data.

They continue:

‘Our findings reveal a pronounced popularity bias in LLMs, with the top 20% of popular items being significantly easier to retrieve than the bottom 20%.

‘This trend highlights the influence of the training data distribution, where popular movies are overrepresented, leading to their disproportionate memorization by the models.'

Conclusion

The dilemma is no longer novel: as training sets grow, the prospect of curating them diminishes in inverse proportion. MovieLens-1M, perhaps among many others, enters these vast corpora without oversight, anonymous amidst the sheer volume of data.

The problem repeats at every scale and resists automation. Any solution demands not just effort but human judgment –  the slow, fallible kind that machines cannot supply. In this respect, the new paper offers no way forward.

 

* A coverage metric in this context is a percentage that shows how much of the original dataset a language model is able to reproduce when asked the right kind of question. If a model is prompted with a movie ID and responds with the correct title and genre, that counts as a successful recall. The total number of successful recalls is then divided by the total number of entries in the dataset to produce a coverage score. For example, if a model correctly returns information for 800 out of 1,000 items, its coverage would be 80 percent.

First published Friday, May 16, 2025

The post Large Language Models Are Memorizing the Datasets Meant to Test Them appeared first on Unite.AI.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 推荐系统 机器学习 数据污染 MovieLens-1M
相关文章