MarkTechPost@AI 04月17日
Model Performance Begins with Data: Researchers from Ai2 Release DataDecide—A Benchmark Suite to Understand Pretraining Data Impact Across 30K LLM Checkpoints
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了DataDecide,这是一个由艾伦人工智能研究所(AI2)联合华盛顿大学和宾夕法尼亚大学发布的综合性预训练实验套件。该套件涵盖了25个不同的语料库和14种模型规模,从400万到10亿参数不等,旨在解决大规模语言模型(LLM)预训练中数据选择的挑战。DataDecide提供了超过1050个模型和3万多个检查点,并在十个下游任务上进行了评估,为研究人员提供了宝贵的资源,以优化预训练数据选择,提高模型性能。

💡DataDecide 是一套全面的预训练实验套件,包含25个数据集和14种模型规模,旨在解决LLM预训练中数据选择的难题。

📚该套件包含了多种数据来源,如Dolma、DCLM、RefinedWeb等,并结合了领域消融、去重、质量过滤等策略,以探索不同数据处理方式对模型的影响。

📊 DataDecide 通过固定token-to-parameter比率进行训练,并提供了10个下游任务的评估,让研究人员能够重复使用检查点,探索新的预测方法,并研究基准对训练数据和模型规模的敏感性。

✅研究结果表明,通过小规模评估(例如1.5亿参数),可以有效预测大规模模型(10亿参数)的最佳数据集,且连续似然度量(如CORRECT PROB)在代码任务中表现更佳。

The Challenge of Data Selection in LLM Pretraining

Developing large language models entails substantial computational investment, especially when experimenting with alternative pretraining corpora. Comparing datasets at full scale—on the order of billions of parameters and hundreds of billions of tokens—can consume hundreds of thousands of GPU hours per run. Consequently, practitioners resort to smaller‐scale experiments as proxies for large‐model behavior. Yet these “pilot” studies are rarely published, producing a fragmented landscape in which each laboratory repeats similar small‐scale tests without shared benchmarks or methodologies . This opacity impedes reproducibility, underutilizes collective insights, and obscures the true trade‑offs between development compute and final model performance.

DataDecide

To address these limitations, the Allen Institute for AI (AI2), in collaboration with the University of Washington and the University of Pennsylvania, today releases DataDecide—a comprehensive suite of controlled pretraining experiments spanning 25 distinct corpora and 14 model sizes from 4 million to 1 billion parameters. DataDecide’s datasets include well‑known sources such as Dolma, DCLM, RefinedWeb, C4, and FineWeb, alongside variations produced by domain ablation, deduplication, quality filtering, and source mixing. Each model is trained at a fixed token‑to‑parameter ratio of 100 (100 tokens per parameter), reflecting the “overtraining” regime that optimizes inference efficiency. In total, over 1,050 models and more than 30,000 checkpoints—each evaluated across ten downstream tasks—are released to the public.

Technical Structure and Pragmatic Benefits

DataDecide orchestrates experiments along three axes:

    By releasing both pretraining datasets and corresponding models, DataDecide enables researchers to:

    Key Findings and Quantitative Insights

    DataDecide’s systematic analysis yields four practical guidelines:

      Concluding Perspective

      DataDecide transforms pretraining data selection from an ad hoc art into a transparent, data‐driven science. By open‑sourcing all 25 corpora, 1,050 models, 30,000+ checkpoints, and evaluation scripts on Hugging Face and GitHub, AI2 invites the community to reproduce findings, extend evaluations to new benchmarks, and innovate on decision‑making methods. As LLM development continues to demand ever‑greater compute resources, DataDecide offers a principled framework for minimizing wasted experiments and maximizing insight—paving the way toward more efficient, reproducible, and collaborative AI research.


      Check out the Paper, Model on Hugging Face and Technical details. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

      [Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

        The post Model Performance Begins with Data: Researchers from Ai2 Release DataDecide—A Benchmark Suite to Understand Pretraining Data Impact Across 30K LLM Checkpoints appeared first on MarkTechPost.

        Fish AI Reader

        Fish AI Reader

        AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

        FishAI

        FishAI

        鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

        联系邮箱 441953276@qq.com

        相关标签

        LLM 预训练 DataDecide 数据选择 AI2
        相关文章