MarkTechPost@AI 前天 14:20
ByteDance Introduces QuaDMix: A Unified AI Framework for Data Quality and Diversity in LLM Pretraining
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ByteDance推出了QuaDMix,一个统一的数据选择框架,旨在系统地平衡LLM预训练中的数据质量和多样性。QuaDMix基于多个质量标准和领域分类评估每个数据样本,并通过参数化函数确定其抽样概率。该框架利用代理模型实验结合LightGBM回归来预测下游性能,从而实现高效的参数优化,无需进行大规模训练。实验表明,与单独优化质量和多样性的方法相比,QuaDMix在多个基准测试中平均性能提高了7.2%,突出了联合方法的有效性。

✨QuaDMix是一个统一的数据选择框架,它通过综合考虑数据质量和领域多样性,来优化大型语言模型(LLM)的预训练过程。

📊QuaDMix包含三个主要阶段:特征提取,质量聚合和质量-多样性感知采样。首先,使用领域标签和多个质量分数注释每个文档。然后,使用特定于域的参数对这些分数进行归一化和合并,以计算聚合的质量分数。最后,根据基于sigmoid的函数对文档进行采样,该函数优先考虑更高质量的样本,同时通过参数化控制来维持领域平衡。

💡QuaDMix通过训练数千个具有不同参数设置的代理模型来进行优化。基于这些代理实验训练的回归模型可以预测性能结果,从而可以识别最佳采样配置。该方法可以对高维参数空间进行结构化探索,从而使数据选择与预期的下游任务更加紧密地对齐。

✅QuaDMix的优势在于,它可以统一优化数据质量和领域多样性,通过代理评估目标选择来适应特定于任务的需求,并通过避免详尽的完整模型再训练来提高计算效率,以及在不增加计算预算的情况下持续提高下游性能。

The pretraining efficiency and generalization of large language models (LLMs) are significantly influenced by the quality and diversity of the underlying training corpus. Traditional data curation pipelines often treat quality and diversity as separate objectives, applying quality filtering followed by domain balancing. This sequential optimization overlooks the complex interdependencies between these factors. High-quality datasets frequently exhibit domain biases, while diversified datasets may compromise quality. In the context of fixed training budgets, there is a critical need to simultaneously optimize for both dimensions to maximize model performance. However, defining and jointly optimizing quality and diversity remain non-trivial challenges.

ByteDance Introduces QuaDMix

ByteDance presents QuaDMix, a unified data selection framework that systematically balances quality and diversity during LLM pretraining. QuaDMix evaluates each data sample based on multiple quality criteria and domain classifications and determines its sampling probability through a parameterized function. The framework employs proxy model experiments combined with LightGBM-based regression to predict downstream performance, enabling efficient parameter optimization without exhaustive large-scale training. Experiments demonstrate that QuaDMix achieves an average performance improvement of 7.2% across multiple benchmarks compared to methods optimizing quality and diversity separately, underscoring the effectiveness of a joint approach.

QuaDMix operates in three principal stages: feature extraction, quality aggregation, and quality-diversity aware sampling. Initially, each document is annotated with domain labels and multiple quality scores. These scores are normalized and merged using domain-specific parameters to compute an aggregated quality score. Documents are subsequently sampled according to a sigmoid-based function that prioritizes higher-quality samples while maintaining domain balance through parameterized controls.

Optimization is performed by training thousands of proxy models across different parameter settings. A regression model, trained on these proxy experiments, predicts performance outcomes, enabling identification of optimal sampling configurations. This method allows for a structured exploration of a high-dimensional parameter space, aligning data selection more closely with intended downstream tasks.

QuaDMix provides several advantages:

Experimental Results and Insights

Validation experiments were conducted using the RefinedWeb dataset, training 530M parameter models from scratch. QuaDMix was compared against several baselines, including Random Selection, Fineweb-edu, AskLLM, DCLM, DSIR, and RegMix. QuaDMix consistently outperformed these methods, achieving an average score of 39.5% across nine diverse benchmarks.

Key observations include:

Conclusion

QuaDMix offers a principled approach to data selection for LLM pretraining, addressing the longstanding challenge of simultaneously optimizing data quality and diversity. By integrating quality aggregation and domain-aware sampling within a unified framework and leveraging proxy-based optimization, QuaDMix establishes a scalable methodology for enhancing LLM pretraining efficiency. While there are opportunities for future improvements—such as refining the parameter space and enhancing proxy model fidelity—QuaDMix represents a significant step towards more systematic and effective data curation strategies for large-scale model development.


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post ByteDance Introduces QuaDMix: A Unified AI Framework for Data Quality and Diversity in LLM Pretraining appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

QuaDMix LLM预训练 数据质量 数据多样性 ByteDance
相关文章