MarkTechPost@AI 前天 11:50
OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenThoughts项目提出了一种新的、可扩展的监督微调(SFT)数据构建方案,旨在提升推理模型的性能。该方案由来自斯坦福大学、华盛顿大学等机构的研究人员开发,通过三轮迭代,构建了OpenThoughts3-1.2M数据集,并在数学、编码和科学等领域取得了显著成果。研究团队采用系统化的实验方法,探索了问题来源、混合、过滤、答案过滤和教师模型等多个关键环节,最终构建了OpenThinker3-7B模型,在7B规模的开放数据模型中表现出色。文章还探讨了未来研究方向,包括跨领域迁移学习和学生模型逼近教师能力时的扩展动态。

🧠 OpenThoughts项目通过三轮迭代构建数据集。OpenThoughts-114K采用自动化验证,OpenThoughts2-1M增强了数据规模,OpenThoughts3-1.2M则通过1000多次消融实验优化了数据构建流程。

💡 研究团队对问题来源、混合、过滤等环节进行了深入实验。实验表明,CodeGolf和竞赛编程问题在编码任务中表现最佳,LLM生成和人工编写的问题在数学领域表现出色,而物理StackExchange问题与化学课本提取在科学领域效果最佳。

👨‍🏫 在教师模型选择上,QwQ-32B在知识蒸馏方面优于DeepSeek-R1,实现了1.9%-2.6%的准确率提升。

🔬 OpenThoughts3-1.2M数据集在数学(AIME24, AMC23, MATH500)、编码(CodeElo, CodeForces, LiveCodeBench)和科学(GPQA Diamond, JEEBench)等八个推理基准上进行了评估,并采用了严格的去重流程和保留的基准测试集,以确保结果的可靠性。

The Growing Complexity of Reasoning Data Curation

Recent reasoning models, such as DeepSeek-R1 and o3, have shown outstanding performance in mathematical, coding, and scientific areas, utilizing post-training techniques like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the complete methodologies behind these frontier reasoning models are not public, which makes research for building reasoning models difficult. While SFT data curation has become a powerful approach for developing strong reasoning capabilities, most existing efforts explore only limited design choices, such as relying solely on human-written questions or single teacher models. Moreover, exploring the extensive design space of various techniques for generating question-answer pairs requires high costs for teacher inference and model training.

Reasoning traces provided by models such as Gemini, QwQ, and DeepSeek-R1 have enabled knowledge distillation techniques to train smaller reasoning models. Projects like OpenR1, OpenMathReasoning, and OpenCodeReasoning collect questions from public forums and competition sites, while Natural Reasoning utilizes pre-training corpora as seed data. Some efforts, such as S1 and LIMO, focus on manually curating small, high-quality datasets of challenging prompts. Other methods, such as DeepMath-103K and Nvidia Nemotron, introduce innovations across data sourcing, filtering, and scaling stages. RL methods, including AceReason and Skywork-OR1, have enhanced reasoning capabilities beyond traditional SFT methods.

OpenThoughts: A Scalable Framework for SFT Dataset Development

Researchers from Stanford University, the University of Washington, BespokeLabs.ai, Toyota Research Institute, UC Berkeley, and 12 additional organizations have proposed OpenThoughts, a new SOTA open reasoning data recipe. OpenThoughts uses a progressive approach across three iterations: OpenThoughts-114K scales the Sky-T1 pipeline with automated verification, OpenThoughts2-1M enhances data scale through augmented question diversity and synthetic generation strategies, and OpenThoughts3-1.2M incorporates findings from over 1,000 ablation experiments to develop a simple, scalable, and high-performing data curation pipeline. Moreover, the model OpenThinker3-7B achieves state-of-the-art performance among open-data models at the 7B scale.

The OpenThoughts3-1.2M is built by ablating each pipeline component independently while maintaining constant conditions across other stages, generating 31,600 data points per strategy and fine-tuning Qwen2.5-7B-Instruct on each resulting dataset. The goal during training is to create the best dataset of question-response pairs for SFT reasoning. Evaluation occurs across eight reasoning benchmarks across mathematics (AIME24, AMC23, MATH500), coding (CodeElo, CodeForces, LiveCodeBench), and science (GPQA Diamond, JEEBench). The experimental design includes a rigorous decontamination process to remove high-similarity samples and maintains a held-out benchmark set for generalization testing. Evalchemy serves as the primary evaluation tool, ensuring consistent evaluation protocols.

Evaluation Insights and Benchmark Performance

The OpenThoughts pipeline evaluation reveals key insights across question sourcing, mixing, filtering, answer filtering, and the teacher model. Question sourcing experiments show that CodeGolf and competitive coding questions achieve the highest performance for code tasks (25.3-27.5 average scores), while LLM-generated and human-written questions excel in mathematics (58.8-58.5 scores), and physics StackExchange questions with chemistry textbook extractions perform best in science (43.2-45.3 scores). Mixing question shows that combining multiple question sources degrades performance, with optimal results of 5% accuracy improvements over diverse mixing strategies. In the teacher model, QwQ-32B outperforms DeepSeek-R1 in knowledge distillation, achieving an accuracy improvement of 1.9-2.6%.

In conclusion, researchers present the OpenThoughts project, showing that systematic experimentation can significantly advance SFT data curation for reasoning models. Researchers developed OpenThoughts3-1.2M, a state-of-the-art open-data reasoning dataset across science, mathematics, and coding domains. The resulting OpenThinker3-7B model achieves superior performance among open-data reasoning models at its scale. However, several limitations remain unexplored, including RL approaches, staged fine-tuning, and curriculum learning strategies. Future research directions include investigating cross-domain transfer effects when optimizing individual domains versus overall performance, and understanding the scaling dynamics as student models approach teacher capabilities.


Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

The post OpenThoughts: A Scalable Supervised Fine-Tuning SFT Data Curation Pipeline for Reasoning Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OpenThoughts 推理模型 数据构建 SFT
相关文章