Elon Musk agrees that we’ve exhausted AI training data

TechCrunch News 01月09日

Elon Musk agrees that we’ve exhausted AI training data

AI专家们普遍认为，用于训练AI模型的真实世界数据已基本耗尽。马斯克指出，人类知识的总和在AI训练中已基本被消耗殆尽，并预测未来AI将转向使用合成数据进行自我学习。多家科技巨头如微软、Meta等已开始使用合成数据训练其AI模型。合成数据不仅能降低训练成本，还能加速AI模型开发。然而，合成数据也存在潜在风险，如可能导致模型“创造力”下降，产生偏见，甚至最终损害其功能。因此，如何在利用合成数据优势的同时，规避其风险，是AI领域亟待解决的问题。

⚠️ AI训练面临数据枯竭：马斯克等专家认为，AI模型训练已基本耗尽人类知识总和，真实世界数据面临枯竭。

🚀 合成数据成为新趋势：AI模型将转向使用合成数据进行自我学习，多家科技巨头已采用合成数据训练模型。

💰 合成数据优势明显：合成数据能显著降低训练成本，例如，Writer的Palmyra X 004模型仅花费70万美元开发。

⚠️ 合成数据风险并存：合成数据可能导致模型“创造力”下降，产生偏见，甚至损害模型功能，需谨慎使用。

Elon Musk concurs with other AI experts that there’s little real-world data left to train AI models on.

“We’ve now exhausted basically the cumulative sum of human knowledge …. in AI training,” Musk said during a live-streamed conversation with Stagwell chairman Mark Penn streamed on X late Wednesday. “That happened basically last year.”

Musk, who owns AI company xAI, echoed themes former OpenAI chief scientist Ilya Sutskever touched on at NeurIPS, the machine learning conference, during an address in December. Sutskever, who said the AI industry had reached what he called “peak data,” predicted a lack of training data will force a shift away from the way models are developed today.

Indeed, Musk suggested that synthetic data — data generated by AI models themselves — is the path forward. “With synthetic data … [AI] will sort of grade itself and go through this process of self-learning,” he said.

Other companies, including tech giants like Microsoft, Meta, OpenAI, and Anthropic, are already using synthetic data to train flagship AI models. Gartner estimates 60% of the data used for AI and analytics projects in 2024 were synthetically generated.

Microsoft’s Phi-4, which was open-sourced early Wednesday, was trained on synthetic data alongside real-world data. So were Google’s Gemma models. Anthropic used some synthetic data to develop one of its most performant systems, Claude 3.5 Sonnet. And Meta fine-tuned its most recent Llama series of models using AI-generated data.

Training on synthetic data has other advantages, like cost savings. AI startup Writer claims its Palmyra X 004 model, which was developed using almost entirely synthetic sources, cost just $700,000 to develop — compared to estimates of $4.6 million for a comparably-sized OpenAI model.

But there as disadvantages as well. Some research suggests that synthetic data can lead to model collapse, where a model becomes less “creative” — and more biased — in its outputs, eventually seriously compromising its functionality. Because models create synthetic data, if the data used to train these models has biases and limitations, their outputs will be similarly tainted.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI训练合成数据数据枯竭模型偏见

相关文章

Import AI 369: Conscious machines are possible; AI agents; the varied uses of synthetic data

Synthetic Data Generation for Robotics with Bill Vass - #588

和@歸藏一起视频会议看完 OpenAI 的发布，讨论了一会，背脊发凉… 1️⃣ 没想到卷推理卷到了这种程度? 现实交流场景下300ms 左右的体验奇点真没想到就这样被...

Using generative AI to improve software testing

Google AI Described New Machine Learning Methods for Generating Differentially Private Synthetic Data

读者问我为啥【筱思萌想】断更了，小竹林也更的如星星之火般少，那当然是因为我这个半吊子作者和小伙伴们去做了个公司???。诺，CEO是这个家伙@kevin_大...

Synthetic Data Generation in Foundation Models and Differential Privacy: Three Papers from Microsoft Research

研究表明，像 ChatGPT 这样的人工智能系统可能很快就会耗尽数据资源

Scaling AI Models: Combating Collapse with Reinforced Synthetic Data

NVIDIA AI Introduces Nemotron-4 340B: A Family of Open Models that Developers can Use to Generate Synthetic Data for Training Large Language Models (LLMs)