MarkTechPost@AI 2024年07月22日
Apple AI Released a 7B Open-Source Language Model Trained on 2.5T Tokens on Open Datasets
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

苹果、华盛顿大学等机构发布了用于语言模型的数据整理框架DCLM,并开源了DCLM模型和数据集。DCLM框架包含一个由Common Crawl提供的240万亿个词元的综合语料库,基于OpenLM框架的有效预训练方法,以及53个下游评估的广泛套件。DCLM为研究人员提供了一个标准化的数据集整理方法,使他们能够进行一致且可比较的实验,从而确定最有效的数据整理策略。

🍎 **DCLM框架简介**:苹果、华盛顿大学等机构发布了用于语言模型的数据整理框架DCLM,旨在解决语言模型训练数据质量和整理方法的挑战。DCLM框架包含一个由Common Crawl提供的240万亿个词元的综合语料库,基于OpenLM框架的有效预训练方法,以及53个下游评估的广泛套件,提供了一个标准化的数据集整理方法,使研究人员能够进行一致且可比较的实验。

🤖 **DCLM框架优势**:DCLM框架提供了一个结构化的工作流程,研究人员可以选择从412M到7B参数的规模,并实验各种数据整理策略,例如去重、过滤和数据混合。研究人员可以使用标准化的训练方法和特定的超参数在整理后的数据集上训练模型。然后,在各种下游任务中评估这些模型的性能,提供一个清晰的数据集质量度量。

📈 **DCLM框架成果**:DCLM框架的引入在语言模型训练方面取得了显著的改进。例如,使用DCLM创建的基线数据集能够从头开始训练一个7B参数的语言模型。该模型在MMLU基准测试中以2.6万亿个训练词元实现了64%的5-shot准确率。这一性能比之前的最先进的开放数据语言模型MAP-Neo提高了6.6个百分点,同时计算量减少了40%。

🔍 **DCLM框架未来展望**:DCLM框架的有效性还体现在其可扩展性。研究人员使用DCLM-Pool(一个从Common Crawl派生的240万亿个词元的语料库)在不同规模(从400M到超过7B参数)进行了广泛的实验。这些实验强调了基于模型的过滤在构建高质量训练集中的关键作用。通过这一严格的过程创建的DCLM基线数据集在各种评估中始终优于其他开源数据集,如RefinedWeb和RedPajama。

🧪 **DCLM框架研究方向**:研究小组还探索了各种数据整理技术的的影响。他们比较了文本提取方法(如resiliparse和trafilatura),发现这些方法与Common Crawl的预提取文本相比,显著提高了下游性能。该小组研究了几种基于模型的质量过滤策略,最终确定fastText OH-2.5 + ELI5分类器是最有效的,在准确性方面提供了显著提升。

Language models (LMs) have become fundamental in natural language processing (NLP), enabling text generation, translation, and sentiment analysis tasks. These models demand vast amounts of training data to function accurately and efficiently. However, the quality and curation of these datasets are critical to the performance of LMs. This field focuses on refining the data collection and preparation methods to enhance the models’ effectiveness.

A significant challenge in developing effective language models is improving training datasets. High-quality datasets are essential for training models that generalize well across various tasks, but creating such datasets is complex. It involves filtering out irrelevant or harmful content, removing duplicates, and selecting the most useful data sources.

Existing methods for dataset curation typically involve heuristic-based filtering, deduplication, and data sourcing from extensive web crawls. While these methods have provided some success, they often need more standardized benchmarks, leading to consistency in the performance evaluation of language models. This variability makes it difficult to determine the most effective data curation strategies, which hinders progress in the field.

Researchers from Apple, the University of Washington, and many other institutions have introduced DataComp for Language Models (DCLM) to address these issues. They have recently open-sourced the DCIM models and datasets on the Hugging Face Platform. The open source release comprises DCLM-7B, DCLM-1B, dclm-7b-it, DCLM-7B-8k, dclm-baseline-1.0, and dclm-baseline-1.0-parquet. This innovative testbed allows controlled experiments with large datasets to improve language models. The DCLM framework includes a comprehensive corpus of 240 trillion tokens from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. This setup provides a standardized approach to dataset curation, enabling consistent and comparable experiments.

DCLM offers a structured workflow for researchers. Participants can choose scales ranging from 412M to 7B parameters and experiment with data curation strategies such as deduplication, filtering, and data mixing. Researchers can train models on curated datasets using a standardized training recipe and specific hyperparameters. The performance of these models is then evaluated on a suite of downstream tasks, providing a clear measure of dataset quality. This systematic approach helps identify the most effective data curation strategies.

The introduction of DCLM has led to notable improvements in language model training. For instance, a baseline dataset created using DCLM enabled the training of a 7B parameter language model from scratch. This model achieved a 64% 5-shot accuracy on the MMLU benchmark with 2.6 trillion training tokens. This performance represents a 6.6 percentage point improvement over the previous state-of-the-art open-data language model, MAP-Neo, while using 40% less computing. The DCLM baseline model also performed comparably to Mistral-7B-v0.3 and Llama 3 8B, which required significantly more computational resources.

The DCLM framework’s effectiveness is further demonstrated by its scalability. Researchers conducted extensive experiments at different scales, from 400M to over 7B parameters, using DCLM-Pool, a corpus of 240 trillion tokens derived from Common Crawl. These experiments highlighted the critical role of model-based filtering in assembling high-quality training sets. The DCLM baseline dataset, created through this rigorous process, consistently outperformed other open-source datasets like RefinedWeb and RedPajama in various evaluations.

The research team also explored the impact of various data curation techniques. They compared text extraction methods, such as resiliparse and trafilatura, and found that these approaches significantly improved downstream performance compared to Common Crawl’s pre-extracted text. The team investigated several model-based quality filtering strategies, ultimately determining that the fastText OH-2.5 + ELI5 classifier was the most effective, providing a substantial lift in accuracy.

In conclusion, the introduction of DCLM enables researchers to conduct controlled experiments and identify the most effective strategies for improving language models by providing a standardized and systematic approach to dataset curation. The DCLM framework sets a new benchmark for dataset quality and demonstrates the potential for significant performance improvements with reduced computational resources.

The post Apple AI Released a 7B Open-Source Language Model Trained on 2.5T Tokens on Open Datasets appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型 开源 DCLM 数据整理 苹果
相关文章