MarkTechPost@AI 2024年07月29日
This AI Paper from Stanford Provides New Insights on AI Model Collapse and Data Accumulation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

斯坦福大学的研究人员进行了一项研究,探讨了在训练生成式AI模型时,数据积累对模型崩溃的影响。与之前研究关注数据替换不同,这项研究模拟了互联网数据集中的合成数据持续积累。实验结果表明,在各种数据类型上使用转换器、扩散模型和变分自动编码器,发现将合成数据与真实数据一起积累可以防止模型崩溃,这与数据替换场景中观察到的性能下降形成对比。

🤔 研究人员通过模拟互联网数据集中的合成数据持续积累,研究了数据积累对生成式AI模型崩溃的影响。

📊 实验结果表明,在各种数据类型上使用转换器、扩散模型和变分自动编码器,发现将合成数据与真实数据一起积累可以防止模型崩溃。

💡 研究人员扩展了对顺序线性模型的现有分析,证明数据积累导致测试误差存在有限且控制良好的上限,与数据替换场景中观察到的线性误差增加形成对比。

🧪 研究人员通过因果转换器、扩散模型和变分自动编码器,在文本、分子和图像数据集上对生成式AI中的模型崩溃进行了实验研究。

🧬 研究人员在GEOM-Drugs分子构象数据上测试了GeoDiff扩散模型,比较了数据替换和积累策略。结果表明,当替换数据时,测试损失会增加,但当积累数据时,性能会保持稳定。

🖼️ 研究人员在CelebA人脸图像上使用VAE,比较了数据替换和积累策略。替换数据会导致模型快速崩溃,测试误差增加,图像质量和多样性下降。积累数据显著减缓了崩溃,在迭代过程中保留了主要变化,但损失了一些细节。

🧠 该研究表明,虽然以前的研究表明,在模型输出上进行训练会降低性能,但这项工作表明,通过在真实和合成数据的混合数据上进行训练,可以防止模型崩溃。

📝 研究结果表明,只要合成数据与真实数据一起积累,而不是完全替换,那么“递归的诅咒”可能不像以前认为的那样严重。

📡 这项研究强调了进一步研究在不断扩展的包含真实和合成数据的混合数据集上训练模型的长期影响的必要性,这反映了网络规模信息的动态本质。

Large-scale generative models like GPT-4, DALL-E, and Stable Diffusion have transformed artificial intelligence, demonstrating remarkable capabilities in generating text, images, and other media. However, as these models become more prevalent, a critical challenge emerges the consequences of training generative models on datasets containing their outputs. This issue, known as model collapse, poses a significant threat to the future development of AI. As generative models are trained on web-scale datasets that increasingly include AI-generated content, researchers are struggling with the potential degradation of model performance over successive iterations, potentially rendering newer models ineffective and compromising the quality of training data for future AI systems.

Existing researchers have investigated model collapse through various methods, including replacing real data with generated data, augmenting fixed datasets, and mixing real and synthetic data. Most studies maintained constant dataset sizes and mixing proportions. Theoretical work has focused on understanding model behavior with synthetic data integration, analyzing high-dimensional regression, self-distillation effects, and language model output tails. Some researchers identified phase transitions in error scaling laws and proposed mitigation strategies. However, these studies primarily considered fixed training data amounts per iteration. Few explored the effects of accumulating data over time, closely resembling evolving internet-based datasets. This research gap highlights the need for further investigation into the long-term consequences of training models on continuously expanding datasets that include both real and synthetic data, reflecting the dynamic nature of web-scale information.

Researchers from Stanford University propose a study that explores the impact of accumulating data on model collapse in generative AI models. Unlike previous research focusing on data replacement, this approach simulates the continuous accumulation of synthetic data in internet-based datasets. Experiments with transformers, diffusion models, and variational autoencoders across various data types reveal that accumulating synthetic data with real data prevents model collapse, in contrast to the performance degradation observed when replacing data. The researchers extend existing analysis of sequential linear models to prove that data accumulation results in a finite, well-controlled upper bound on test error, independent of model-fitting iterations. This finding contrasts with the linear error increase seen in data replacement scenarios.

Researchers experimentally investigated model collapse in generative AI using causal transformers, diffusion models, and variational autoencoders across text, molecular, and image datasets. 

    Transformer-Based Causal Language Modeling:

To test the model collapse in transformer-based language models researchers used GPT-2 and Llama2 architectures of various sizes, pre-trained on TinyStories. They compared data replacement and accumulation strategies over multiple iterations. Results consistently showed that replacing data increased test cross-entropy (worse performance) across all model configurations and sampling temperatures. In contrast, accumulating data maintained or improved performance over iterations. Lower sampling temperatures accelerated error increases when replacing data, but the overall trend remained consistent. These findings strongly support the hypothesis that data accumulation prevents model collapse in language modeling tasks, while data replacement leads to progressive performance degradation.

    Diffusion Models on Molecular Conformation Data: 

Researchers tested GeoDiff diffusion models on GEOM-Drugs molecular conformation data, comparing data replacement and accumulation strategies. Results showed increasing test loss when replacing data, but stable performance when accumulating data. Unlike language models, significant degradation occurred mainly in the first iteration with synthetic data. These findings further support data accumulation as a method to prevent model collapse across different AI domains.

    Variational Autoencoders on Image Data (VAE)

Researchers used VAEs on CelebA face images, comparing data replacement and accumulation strategies. Replacing data led to rapid model collapse, with increasing test error and decreasing image quality and diversity. Accumulating data significantly slowed collapse, preserving major variations but losing minor details over iterations. Unlike language models, accumulation showed slight performance degradation. These findings support data accumulation’s benefits in mitigating model collapse across AI domains while highlighting variations in effectiveness depending on model type and dataset.

This research investigates model collapse in AI, a concern as AI-generated content increasingly appears in training datasets. While previous studies showed that training on model outputs can degrade performance, this work demonstrates that model collapse can be prevented by training on a mixture of real and synthetic data. The findings, supported by experiments across various AI domains and theoretical analysis for linear regression, suggest that the “curse of recursion” may be less severe than previously thought, as long as synthetic data is accumulated alongside real data rather than replacing it entirely.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post This AI Paper from Stanford Provides New Insights on AI Model Collapse and Data Accumulation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI模型崩溃 数据积累 合成数据 生成式AI 斯坦福大学
相关文章