MarkTechPost@AI 01月31日
Open Thoughts: An Open Source Initiative Advancing AI Reasoning with High-Quality Datasets and Models Like OpenThoughts-114k and OpenThinker-7B
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Open Thoughts项目旨在解决高质量推理数据集的稀缺问题,通过开源OpenThoughts-114k数据集和OpenThinker-7B模型,推动AI推理能力的发展。该项目由Bespoke Labs和斯坦福等高校的DataComp社区主导,致力于提供公开、高质量的推理数据集和数据生成策略。OpenThoughts-114k数据集包含11.4万个推理示例,涵盖数学问题解决和逻辑推理等多种挑战。OpenThinker-7B模型在OpenThoughts-114k数据集上训练,性能超越多个同类模型,为开源AI推理领域提供了强有力的工具。

🌟Open Thoughts项目致力于解决高质量推理数据集的限制问题,通过开源数据集和模型,促进AI推理领域的发展。

📚OpenThoughts-114k数据集扩展了以往的规模,包含11.4万个高质量推理示例,涵盖数学和逻辑等多种推理挑战,采用DeepSeek-R1的推理蒸馏技术生成。

🧠OpenThinker-7B模型是在Qwen-2.5-7B-Instruct基础上微调的,在多个推理任务中表现优异,甚至超越了GPT-4o等专有模型,为开源推理模型提供了新的选择。

🔓Open Thoughts项目完全开源,包括模型权重、数据集和代码,确保了研究的透明性和可复现性,鼓励更多人参与到AI推理的发展中。

The critical issue of restricted access to high-quality reasoning datasets has limited open-source AI-driven logical and mathematical reasoning advancements. While proprietary models have leveraged structured reasoning demonstrations to enhance performance, these datasets and methodologies remain closed, restricting independent research and innovation. The lack of open, scalable reasoning datasets has created a bottleneck for AI development.

Over recent years, models such as SkyT1, STILL-2, and DeepSeek-R1 have demonstrated that a relatively small set of high-quality reasoning demonstrations on hundreds of thousands can substantially enhance a model’s ability to perform complex logical and mathematical reasoning tasks. Still, most reasoning datasets and the methodologies behind their creation remain proprietary, limiting access to crucial resources necessary for further exploration in the field.

The Open Thoughts initiative, led by Bespoke Labs and the DataComp community from Stanford, UC Berkeley, UT Austin, UW, UCLA, UNC, TRI, and LAION, is an ambitious open-source project aiming to curate and develop high-quality reasoning datasets to address the above concerns with the availability of datasets. This project seeks to establish the best open reasoning datasets to enhance language models’ cognitive capabilities. The team aims to provide publicly available, state-of-the-art reasoning datasets and data generation strategies. In this effort, they have released the OpenThoughts-114k reasoning dataset and the associated OpenThinker-7B model. Let’s look into the details of both of them one by one.

The OpenThoughts-114k Dataset: A New Standard in Open Reasoning Data

This dataset was designed to provide a large-scale, high-quality corpus of reasoning demonstrations to improve language models’ reasoning abilities. OpenThoughts-114k is an extension of previous datasets like Bespoke-Stratos-17k, which only contained 17,000 examples. By scaling up to 114,000 reasoning examples, this dataset has improved performance on various reasoning benchmarks. OpenThoughts-114k was generated using reasoning distillation techniques inspired by DeepSeek-R1, which showed that synthetic reasoning demonstrations could be produced efficiently and at scale. This dataset incorporates diverse reasoning challenges, ranging from mathematical problem-solving to logical deduction, thereby serving as a valuable resource for improving model robustness across multiple reasoning domains.

OpenThinker-7B: A Model for Advanced Reasoning

Alongside the release of OpenThoughts-114k, the Open Thoughts team also introduced OpenThinker-7B, a fine-tuned version of Qwen-2.5-7B-Instruct. This model was trained specifically on OpenThoughts-114k and substantially improved over its predecessors. Over 20 hours, it was trained using four 8xH100 nodes. It was trained using the Transformers 4.46.1 library and PyTorch 2.3.0 to ensure compatibility with widely used ML frameworks.

In some reasoning tasks, OpenThinker-7B outperforms comparable models such as Bespoke-Stratos-7B, DeepSeek-R1-Distill-Qwen-7B, and even GPT-4o. Benchmarked using Evalchemy, it demonstrated impressive results on datasets such as AIME24: 43.3%, MATH500: 83.0%, GPQA-D: 42.4%, LCB Easy: 75.3%, and LCB Medium: 28.6%. These results indicate that OpenThinker-7B is a formidable open-source alternative to proprietary reasoning models.

Fully Open-Source: Weights, Data, and Code

A defining feature of the Open Thoughts project is its commitment to full transparency. Unlike proprietary models such as GPT-4o and o1-mini, which keep their datasets and training methodologies closed, OpenThinker-7B and OpenThoughts-114k are entirely open-source. This means:

    Open Model Weights: The OpenThinker-7B model weights are publicly accessible, allowing researchers and developers to fine-tune and build upon the model.Open Data: The OpenThoughts-114k dataset is freely available for anyone to use, modify, and expand.Open Code: The data generation, evaluation, and training code for OpenThinker-7B are all hosted on GitHub, ensuring complete transparency and reproducibility.

The Open Thoughts project is only in its early stages, with plans for further expansion. Some potential future directions include:

In conclusion, Open Thoughts represents a transformative effort to democratize AI reasoning. By launching OpenThoughts-114k and OpenThinker-7B as open-source resources, the project empowers the AI community with high-quality data and models to advance reasoning research. With continued collaboration and expansion, Open Thoughts has the potential to redefine how AI approaches logical, mathematical, and cognitive reasoning tasks.

Sources


Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)

The post Open Thoughts: An Open Source Initiative Advancing AI Reasoning with High-Quality Datasets and Models Like OpenThoughts-114k and OpenThinker-7B appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Open Thoughts AI推理 开源数据集 OpenThinker-7B 推理模型
相关文章