MarkTechPost@AI 2024年12月22日
Meet FineFineWeb: An Open-Sourced Automatic Classification System for Fine-Grained Web Data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

FineFineWeb是由多模态艺术投影(M-A-P)研究人员推出的一个大型开源自动分类系统,专门用于精细化的网络数据。该项目将去重的Fineweb分解为67个独特的类别,并提供了大量的种子数据。通过对垂直类别与常见基准进行相关性分析,以及详细的URL和内容分布分析,该系统为PPL评估提供了专业的测试集。此外,还提供了FastText和BERT实现的完整训练材料,并计划基于RegMix方法提出数据比例分配的建议。该系统通过多步骤工作流程,包括去重、URL标记、粗召回、细召回等阶段,最终生成高质量的精细化网络数据。

🗂️FineFineWeb 将去重后的 Fineweb 数据集分解为 67 个独特的类别,并提供了大量的种子数据,方便后续的分类和分析。

🔍该系统采用多阶段的数据构建过程,包括使用 GPT-4 进行 URL 标记,使用 Qwen2-7B-Instruct 模型进行粗召回,以及使用 Qwen2-72B-Instruct 和 BERT 模型进行细召回,确保数据的高质量和准确性。

📊通过对领域和基准之间的相似性分析,研究人员发现代码相关的基准与大多数领域存在显著差异,而通用知识基准与多个领域关系密切,这表明数据集具有广泛的知识分布。

📈领域间重复性分析显示,大多数领域之间的URL重复率较低,表明FineFineWeb的数据具有较高的独特性。同时,领域与基准的相关性分析揭示了STEM领域与推理基准之间,以及知识密集型领域与事实基准之间的强相关性。

Multimodal Art Projection (M-A-P) researchers have introduced FineFineWeb, a large open-source automatic classification system for fine-grained web data. The project decomposes the deduplicated Fineweb into 67 unique categories with extensive seed data. Moreover, a comprehensive correlation analysis between vertical categories and common benchmarks and detailed URL and content distribution analysis are conducted. The system provides specialized test sets for PPL evaluation, featuring both “small cup” validation and “medium cup” test options. Complete training materials for FastText and Bert implementation accompany the dataset, with upcoming suggestions for data proportioning based on RegMix methodology.

The data construction process for FineFineWeb follows a systematic multi-step workflow. The initial deduplication of FineWeb employs exact deduplication and MinHash techniques. URL labeling utilizes GPT-4 to process the top million root URLs, categorizing them into Domain-of-Interest (DoI) and Domain-of-Non-Interest (DoNI) URLs. Further, the coarse recall phase involves domain-specific sampling based on the labeled root URLs, with Qwen2-7B-Instruct handling the labeling of 500K positive and negative data points. FastText models, trained on this labeled data, perform coarse recall operations across FineWeb to generate Coarse DoI Data.

The fine recall stage advances the data refinement process using Qwen2-72B-Instruct to label the Coarse DoI Data, creating 100K Dol positive and 100K Dol negative data points. After that, a BERT model, trained on this labeled data, performs fine recall to produce the final DoI subset of FineFineWeb. Moreover, the entire coarse-fine recall iteration undergoes three rounds with specific modifications: 

The domain-domain similarity Analysis employs a sophisticated analytical approach using proportional weighted sampling across domain subsets, processing one billion tokens from the domain subsets. Then the BGE-M3 model is used to generate two types of embeddings: domain embeddings from domain subset samples and benchmark embeddings from benchmark samples. The analysis concludes by calculating MMD and Wasserstein distances between domain embeddings and benchmark embeddings to quantify domain relationships.

The similarity analysis reveals several key patterns in domain-benchmark relationships. Code-related benchmarks (MBPP and HumanEval) show significant distance from most domains except mathematics, indicating limited code representation in the dataset. General knowledge benchmarks (Hellaswag, ARC, MMLU, BoolQ) demonstrate close relationships with multiple domains, suggesting broad knowledge distribution, while excluding gambling content. Moreover, GSM8K and TriviaQA exhibit notable domain-specific variations, particularly in mathematics and factual content. Lastly, the gambling domain stands distinctly separate, showing minimal overlap with other domains and benchmarks.

The domain-domain duplication analysis examines URL uniqueness across domains using TF-IDF values. High TF-IDF scores indicate domain-specific unique URLs, while low values suggest common URLs across domains. The analysis reveals minimal duplication across most domains, with exceptions in topicality, pet, and atmospheric science categories. The domain-benchmark correlation study, conducted across 28 models, compares domain-specific performance (BPC) rankings with benchmark performance rankings using Spearman correlation. STEM-related domains show stronger correlations with reasoning-focused benchmarks (ARC, MMLU, GSM8K, HumanEval, MBPP), while knowledge-intensive domains like literature and history correlate higher with fact-based benchmarks like TriviaQA.


Check out the Dataset and Tweet. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Meet FineFineWeb: An Open-Sourced Automatic Classification System for Fine-Grained Web Data appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

FineFineWeb 数据分类 开源系统 多模态 网络数据
相关文章