The next wave of model improvements will be due to data quality

少点错误 06月28日 22:35

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文探讨了数据质量在提升AI模型性能方面的重要性，强调了除算法和硬件之外的另一个关键因素。文章指出，高质量的数据，如DeepResearch分析报告，对于训练AI模型至关重要。大型公司凭借庞大的用户基础和数据积累，在模型训练方面具有难以复制的优势。此外，用户反馈和任务完成情况，如OpenAI Operator和OpenAI Codex收集的数据，也为模型提供了宝贵的训练信号。文章认为，高质量的真实世界反馈数据是提升AI模型性能的关键。

💡 数据质量是提升AI模型性能的关键因素，文章强调了数据质量的重要性，认为其与算法和硬件同等重要。

📊 高质量数据来源：DeepResearch分析报告。GPT5将基于DeepResearch的请求进行训练，即使没有人类分析，模型也能从历史报告中获取正确信息。

📈 用户反馈的重要性：OpenAI Operator和OpenAI Codex收集的用户反馈数据，为模型提供了宝贵的训练信号，有助于改进模型，即使没有算法改进也能提升性能。

💪 竞争优势：拥有大量用户的大公司，能够产生海量数据，从而在模型训练方面具有难以复制的优势。

Published on June 28, 2025 2:34 PM GMT

When talking about why the models improve, people frequently focus on algorithmic improvements and on hardware improvements. This leaves out improvements in data quality.

DeepResearch provides a huge quantity of high quality analysis reports. GPT5 will almost certainly be trained on all the DeepResearch requests (where there are no reason to believe they are wrong like bad user feedback) of users that haven't opted out of their data being used for training. This means that when users ask GPT5 questions where no human has written an anlaysis of the question, GPT5 might still get facts right because of past DeepResearch reports.

This means that the big companies who do have a massive amount of users that produce a massive amount of DeepResearch requests will have a leg-up that's hard to copy for smaller players who don't have as many DeepResearch reports.

OpenAI Operator asks the user after every task it does whether or not it successfully did the task. This is very valuable training data that will improve OpenAI operator even without additional algorithmic improvements.

OpenAI Codex is at the moment in a mode where the user can let Codex run a task four times in parallel. If the user picks then one of the results for a PR or no result for a PR that's a very valuable training signal. Even which accepted PR's it's possible to gain more information about the quality of the PR by looking at whether the new code is long-living or gets soon deleted for being low quality.

The present models are mostly not trained with real-world feedback data of the quality that OpenAI Operator and OpenAI Codex are now able to provide.

Of course Google and Anthropic have their own version of these featurs that will provide them with data as well.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签