Communications of the ACM - Artificial Intelligence 前天 00:42
The Collapse of GPT
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型(LLMs)在训练过程中面临的“模型崩溃”问题,即模型在重复使用自身生成的文本进行训练时,其性能逐渐下降的现象。文章分析了模型崩溃的成因、影响以及可能的解决方案,包括数据质量的控制与优化。通过对模型训练数据的深入分析,揭示了LLMs发展过程中所面临的挑战与机遇。

🤖 模型崩溃是指LLMs在利用自身生成的文本进行训练时,由于训练数据不再匹配真实世界的数据分布,导致模型输出质量下降的现象,类似于“垃圾进,垃圾出”的结果。

📉 模型崩溃的主要原因是由于LLMs生成的文本是有限的样本,重复使用会导致模型“忘记”一些低频的、关键信息,从而使得模型倾向于产生更高概率的事件。

💡 解决模型崩溃的一个方法是对合成数据进行“策展”,例如通过人工筛选、其他LLMs的评估以及预设规则等方式提高合成数据的质量,使之更接近原始数据分布。

Ever since ChatGPT was released to the public in November 2022, people have been using it to generate text, from emails to blog posts to bad poetry, much of which they post online. Since that release, the companies that build the large language models (LLMs) on which such chatbots are based—such as OpenAI’s GPT 3.5, the technology underlying ChatGPT—have also continued to put out newer versions of their models, training them with new text data, some of which they scraped off the Web. That means, inevitably, that some of the training data used to create LLMs did not come from humans, but from the LLMs themselves.

That has led computer scientists to worry about a phenomenon they call model collapse. Basically, model collapse happens when the training data no longer matches real-world data, leading the new LLM to produce gibberish, in a 21st-century version of the classic computer aphorism “garbage in, garbage out.”

LLMs work by learning the statistical distribution of so-called tokens—words or parts of words—within a language by examining billions of sentences garnered from sources including book databases, Wikipedia, and the Common Crawl dataset, a collection of material gathered from the Internet. An LLM, for instance, will figure out how often the word “president” is associated with the word “Obama” versus “Trump” versus “Hair Club for Men.” Then, when prompted by a request, it will produce words that it reasons have the highest probability of meeting that request and of following from previous words. The results bear a credible resemblance to human-written text.

Model collapse is basically a statistical problem, said Sanmi Koyejo, an assistant professor of computer science at Stanford University. When machine-generated text replaces human-generated text, the distribution of tokens no longer matches the natural distribution produced by humans. As a result, the training data for a new round of modeling does not match the real world, and the new model’s output gets worse. “The thing we’re worried about is that the distribution of your data that you end up with, if you’re trying to fit your model, ends up really far from the actual distribution that generated the data,” he said.

The problem arises because whatever text the LLM generates would be, at most, a subsample of the sentences on which it was trained. “Because you generate a finite sample, you have some probability of not sampling them,” said Yarin Gal, an associate professor of machine learning at Oxford University. “Once you don’t sample, then they disappear. They will never appear again. So every time you generate data, you basically start forgetting more and more of the tail events and therefore that leads to the concentration of the higher probability events.” Gal and his colleagues published a study in Nature in July that showed indiscriminate use of what they called ‘recursively generated data’ caused the models to fail.

The problem is not limited to LLMs. Any generative model that is iteratively trained can suffer the same fate if it starts ingesting machine-produced data, Gal says. That includes stable diffusion models that create images, such as Dall-E. The issue also can affect variational autoencoders, which create new data samples by producing variations of their original data. It can apply to Gaussian mixture models, a form of unsupervised machine learning that sorts subpopulations of data into clusters; they are used to analyze customer preferences, predict stock prices, and analyze gene expression.

Collapse is not a danger for models that incorporate synthetic data but only do so once, such as neural networks used to identify cancer in medical images, where synthetic data was used to augment rare or expensive real data. “The main distinction is that model collapse happens when you have multiple steps, where each step depends on the output from the previous step,” Gal said.

The theory that replacing training data with synthetic data will quickly lead to the demise of LLMs is sound, Koyejo said. In practice, however, not all human data gets replaced immediately. Instead, when the generated text is scraped from the Internet, it gets mixed in with human text. “You create synthetic data, you add that to real data, so you now have more data, which is real data plus synthetic data,” he said. What is actually happening, he said, is not data replacement, but data accumulation. That slows the degradation of the dataset.

Simply accumulating data may stop model collapse but can cause other problems if done without thought, said Yunzhen Feng, a Ph.D. student at the Center for Data Science at New York University. As a rule, the performance of neural networks improves as their size increases. Naively mixing real and synthetic data together, however, can slow that improvement. “You can still obtain similar performance, but you need much more data. That means you’re using much more compute and much more money to achieve that,” he said.

One challenge is that there is no easy way to tell whether text found on the Internet is synthetic or human-generated. Though there have been attempts to automatically identify text from LLMs, none have been entirely successful. Research into this problem is ongoing, Gal said.

Solving with curation

There are ways, however, to make the addition of synthetic data less of a problem.

One approach is to curate the synthetic data to make sure it is of good quality. Some curation happens naturally, Gal said; people do not post everything their chatbot creates to the Internet, weeding out the material that contains false information or simply does not make sense, so that improves the training set.

Curation can also be a deliberate process to make sure high-quality data goes into a training set. Feng, for instance, has experimented with asking the LLM to assess the quality of its own output. LLMs naturally select the words they think have the highest probability of fitting into a context. In doing so, they internally generate a score rating how confident they are that they are pairing the best words together. That same mechanism can be used to assess already generated text to rate its quality, with low-scoring results removed or the highest-scoring result of several attempts selected as the best. The idea is similar to a method used to fine-tune LLMs called reinforcement learning from human feedback (RLHF), in which people provide examples of good results, thereby pushing the models toward producing similar results. In this case, though, the LLM is generating its own feedback.

How well that works varies by case, Feng said. The feedback can be improved by having other LLMs assess the same text and combining the results from different models. Including human assessments also improves the outcomes, as does applying some pre-written rules about what the output should look like. Eliminating lower-quality results from the synthetic data makes the generated data more closely resemble original data, he said. “It’s like you have a distribution of the synthetic data, you have a distribution of the real data, and you want to close the gap between them as much as possible,” he said.

Improving the quality of synthetic data could also help with another challenge LLMs are facing as they try to improve: a dearth of new data on which to train. Scientists from Epoch AI, a research institute that focuses on trends in AI, have predicted the world will run out of new text to train on sometime between 2026 and 2032. With no new data on which to train future generations of LLMs, progress could stagnate. “The interesting question is, can synthetic data lead to not just stagnation but actual improvement in the model?” asked Pablo Villalobos, a staff researcher at Epoch.

With curation of high-quality synthetic data, he said, the question becomes “whether this can be done iteratively so that each model generates better data that is used to train another model in basically the opposite of model collapse, in some virtuous circle.” He is not yet sure whether such improvement is possible, but sees some signs it could be.

Other issues arise from training new models on generated data that do not quite reach the level of model collapse. For instance, Koyejo said, synthetic data could increase the likelihood that LLMs will discriminate against people in minority groups. Because any minority is by definition a smaller part of the data distribution, losing the tails of the distribution could make minorities disappear entirely. “Data tends to anchor on majority subgroups,” he said. “It tends to be good at capturing the most popular themes and less good at capturing tails. So less represented demographics can get erased in various ways.”

While such erasure is something that could happen, he added, the issue has not been well studied. His colleague Diyi Yang, an assistant professor in the natural language processing group at Stanford, said there has been very little research into the question of how model collapse affects diversity issues. “Part of the reason is that, if you think about any existing big models, a lot of the training dynamics or checkpoints of those models actually are not really transparent or publicly available,” she said.

In the end, Gal argued, model collapse is an important consideration, but not the matter of imminent disaster that some news coverage has made it out to be. “It’s a matter for the tech companies who build these models to be aware of how the models are being used and how the models are being trained, in order to avoid training on synthetic data that they themselves generated.”

Further Reading

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大语言模型 模型崩溃 数据质量 AI训练
相关文章