TechCrunch News 01月09日
Elon Musk agrees that we’ve exhausted AI training data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AI专家们普遍认为,用于训练AI模型的真实世界数据已基本耗尽。马斯克指出,人类知识的总和在AI训练中已基本被消耗殆尽,并预测未来AI将转向使用合成数据进行自我学习。多家科技巨头如微软、Meta等已开始使用合成数据训练其AI模型。合成数据不仅能降低训练成本,还能加速AI模型开发。然而,合成数据也存在潜在风险,如可能导致模型“创造力”下降,产生偏见,甚至最终损害其功能。因此,如何在利用合成数据优势的同时,规避其风险,是AI领域亟待解决的问题。

⚠️ AI训练面临数据枯竭:马斯克等专家认为,AI模型训练已基本耗尽人类知识总和,真实世界数据面临枯竭。

🚀 合成数据成为新趋势:AI模型将转向使用合成数据进行自我学习,多家科技巨头已采用合成数据训练模型。

💰 合成数据优势明显:合成数据能显著降低训练成本,例如,Writer的Palmyra X 004模型仅花费70万美元开发。

⚠️ 合成数据风险并存:合成数据可能导致模型“创造力”下降,产生偏见,甚至损害模型功能,需谨慎使用。

Elon Musk concurs with other AI experts that there’s little real-world data left to train AI models on.

“We’ve now exhausted basically the cumulative sum of human knowledge …. in AI training,” Musk said during a live-streamed conversation with Stagwell chairman Mark Penn streamed on X late Wednesday. “That happened basically last year.”

Musk, who owns AI company xAI, echoed themes former OpenAI chief scientist Ilya Sutskever touched on at NeurIPS, the machine learning conference, during an address in December. Sutskever, who said the AI industry had reached what he called “peak data,” predicted a lack of training data will force a shift away from the way models are developed today.

Indeed, Musk suggested that synthetic data — data generated by AI models themselves — is the path forward. “With synthetic data … [AI] will sort of grade itself and go through this process of self-learning,” he said.

Other companies, including tech giants like Microsoft, Meta, OpenAI, and Anthropic, are already using synthetic data to train flagship AI models. Gartner estimates 60% of the data used for AI and an­a­lyt­ics projects in 2024 were syn­thet­i­cally gen­er­ated.

Microsoft’s Phi-4, which was open-sourced early Wednesday, was trained on synthetic data alongside real-world data. So were Google’s Gemma models. Anthropic used some synthetic data to develop one of its most performant systems, Claude 3.5 Sonnet. And Meta fine-tuned its most recent Llama series of models using AI-generated data.

Training on synthetic data has other advantages, like cost savings. AI startup Writer claims its Palmyra X 004 model, which was developed using almost entirely synthetic sources, cost just $700,000 to develop — compared to estimates of $4.6 million for a comparably-sized OpenAI model.

But there as disadvantages as well. Some research suggests that synthetic data can lead to model collapse, where a model becomes less “creative” — and more biased — in its outputs, eventually seriously compromising its functionality. Because models create synthetic data, if the data used to train these models has biases and limitations, their outputs will be similarly tainted. 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI训练 合成数据 数据枯竭 模型偏见
相关文章