MarkTechPost@AI 2024年10月09日
LLM360 Group Introduces TxT360: A Top-Quality LLM Pre-Training Dataset with 15T Tokens
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

LLM360发布TxT360预训练数据集,包含15万亿个标记。该数据集融合多样性、规模和严格数据过滤,通过多种方式筛选,加入高质量语料,推动AI和NLP研究发展,且团队将公开代码库。

🎯TxT360是具有开创性的预训练数据集,包含15万亿个标记,通过融合多种新来源,如FreeLaw、PG - 19、科学论文和维基百科等,形成丰富且细致的数据集,以增强下一代LLM的能力。

🧹TxT360的创建从Common Crawl开始,团队进行了严格的过滤,包括从WARC文件中分离出干净、连贯的文本,去除非英语内容、冗余或低价值的来源、重复内容以及不符合质量标准的文档和行,最终仅保留2.35%的高质量文本。

🔍LLM360通过精确去重(使用Bloom过滤器)和模糊去重(使用MinHash算法)确保TxT360数据集包含独特内容,避免重复学习的问题。

📚在过滤过程后,LLM360添加了精心挑选的高质量语料库,包括科学论文、法律文件、经典书籍和精选的维基百科内容,这些来源经过定制的流程以保持数据的完整性和质量。

In the ever-evolving world of large language models (LLMs), pre-training datasets form the backbone of how AI systems comprehend and generate human-like text. LLM360 has recently unveiled TxT360, a groundbreaking pre-training dataset comprising 15 trillion tokens. This release combines diversity, scale, and rigorous data filtering to achieve one of the most sophisticated open-source datasets to date.

A Dataset Built on New Foundations

TxT360 differentiates itself from previous datasets by including fresh sources such as FreeLaw (legal corpora), PG-19 (a collection of books), scientific papers, and Wikipedia. By blending these sources, TxT360 presents a richer and more nuanced dataset, designed to bolster the capabilities of the next generation of LLMs.

From Common Crawl to Clean Data

The creation of TxT360 began with Common Crawl, a publicly available web scrape that serves as the foundation for many modern language models.. However, simply using raw web data would not achieve the high standards LLM360 aimed for. Instead, the team embarked on a rigorous filtering journey to extract the most useful text from the massive collection of WARC (Web ARChive) files.

    Text Extraction: Clean, coherent text was isolated from noisy web data in WARC files.Language Filtering: Non-English content was removed to maintain a consistent dataset.URL Filtering: Redundant or low-value sources were filtered out, including spammy or promotional sites.Repetition Removal: Extensive efforts targeted repeated lines, paragraphs, and n-grams.Document and Line-Level Filtering: Heuristics were used to remove documents and lines that did not meet quality benchmarks.

In total, 97.65% of the original data was filtered out, retaining only high-quality, meaningful text to ensure robust and nuanced language models.

Global Deduplication

Building a high-quality dataset like TxT360 required effective deduplication. LLM360 tackled this through two approaches: exact deduplication using a Bloom filter and fuzzy deduplication using a MinHash algorithm. These methods ensured that the dataset contained unique content, avoiding the pitfalls of repetitive learning.

High-Quality Sources

After the filtering process, LLM360 added handpicked, high-quality corpora, including scientific papers, legal documents, classic books, and curated Wikipedia content. Each of these specialized sources went through tailored pipelines to preserve data integrity and quality, ensuring that the resulting language models can handle a wide range of topics.

TxT360: A New Era for Open-Source AI

The release of TxT360 marks a significant leap forward in AI and NLP research. LLM360’s meticulous construction and filtering demonstrate that quality and quantity can coexist. With 15 trillion tokens, TxT360 supports the development of nuanced, capable, and intelligent language models.

Moreover, LLM360’s transparency about their processes sets a new standard in the field. According to the research group, their upcoming release of codebase will offer insights into the methodologies that underpinned this super cool dataset.


Check out the Details and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post LLM360 Group Introduces TxT360: A Top-Quality LLM Pre-Training Dataset with 15T Tokens appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM360 TxT360 预训练数据集 数据过滤
相关文章