MarkTechPost@AI 02月07日
Prime Intellect Releases SYNTHETIC-1: An Open-Source Dataset Consisting of 1.4M Curated Tasks Spanning Math, Coding, Software Engineering, STEM, and Synthetic Code Understanding
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Prime Intellect发布了SYNTHETIC-1,一个包含140万个结构化任务和验证器的开源数据集,旨在提升AI模型的推理能力。该数据集涵盖数学、编程和科学等领域,通过提供组织良好、可靠的数据,解决了现有资源在训练复杂推理任务模型方面的不足。SYNTHETIC-1包含多种任务类型,包括带有符号验证器的高中数学题、带有单元测试的编程题、带有LLM评估的开放式STEM问题以及真实的软件工程任务,为研究人员和开发者提供了一个改进AI结构化问题解决能力的结构化基础。

🧮SYNTHETIC-1数据集包含777,000道带有符号验证器的高中数学竞赛水平题目,这些题目经过LLM筛选,移除了不可验证的题目,并将选择题改写为直接回答形式。

💻数据集中还包含144,000道带有单元测试的编程题,这些题目来源于Apps、Codecontests、Codeforces和TACO等数据集,最初仅包含Python题目,后来扩展到包括JavaScript、Rust和C++,增加了挑战的多样性和深度。

🧪SYNTHETIC-1还包含313,000道开放式STEM问题,这些问题来自StackExchange数据集,涵盖广泛的技术和科学主题。问题的选择侧重于需要推理而非简单信息检索的问题,并由LLM评估答案与社区高票答案的对齐程度。

⚙️数据集中包含70,000个真实世界的软件工程任务,这些任务来自CommitPack数据集中的GitHub提交,涉及基于提交指令修改代码文件。LLM通过将解决方案与实际提交后的代码状态进行比较来评估解决方案。

In artificial intelligence and machine learning, high-quality datasets play a crucial role in developing accurate and reliable models. However, collecting extensive, verified data—particularly in specialized domains like mathematics, coding, and science—remains a challenge. Traditional data-gathering methods often fail to produce datasets that effectively train models for complex reasoning tasks. This gap highlights the need for new approaches to dataset creation and verification.

Prime Intellect has introduced SYNTHETIC-1, an open-source dataset designed to provide verified reasoning traces in math, coding, and science. Built with the support of DeepSeek-R1, this dataset consists of 1.4 million structured tasks and verifiers. The objective of SYNTHETIC-1 is to improve reasoning models by supplying them with well-organized, reliable data, addressing the shortcomings of existing resources.

SYNTHETIC-1 includes a range of task types, each designed to ensure quality and relevance:

The structured nature of SYNTHETIC-1 makes it a valuable resource for training models in structured reasoning. By including programmatically verifiable problems, such as coding tasks with unit tests, the dataset ensures clear correctness criteria. Additionally, open-ended reasoning questions verified by LLM judges provide challenges that push the limits of current AI capabilities. The dataset’s collaborative framework also allows for continuous improvement and expansion, fostering a shared effort to refine AI training resources.

SYNTHETIC-1 represents a step forward in creating high-quality datasets for reasoning-based AI models. By addressing gaps in existing datasets, it provides a structured foundation for improving machine reasoning in math, coding, and science. The project also encourages ongoing contributions, making it an evolving resource for researchers and developers working to advance AI’s capabilities in structured problem-solving.


Check out the Details and Dataset on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

The post Prime Intellect Releases SYNTHETIC-1: An Open-Source Dataset Consisting of 1.4M Curated Tasks Spanning Math, Coding, Software Engineering, STEM, and Synthetic Code Understanding appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SYNTHETIC-1 开源数据集 人工智能 推理模型 机器学习
相关文章