MarkTechPost@AI 前天 05:15
NVIDIA Introduces CLIMB: A Framework for Iterative Data Mixture Optimization in Language Model Pretraining
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA 研究人员提出了 CLIMB 框架,用于自动化发现和优化大语言模型预训练的数据混合。CLIMB 结合了无监督聚类和迭代优化,以识别适合通用或特定领域目标的数据混合。该框架通过聚类大规模文本数据、构建候选混合、使用代理模型评估混合,并使用回归预测器来估计混合性能,从而在固定计算预算下找到有效的数据混合。实验表明,CLIMB 在多个通用推理任务和特定领域基准测试中均优于现有方法,并提供了可复现的资源供研究使用。

💡 CLIMB 是一种用于优化大语言模型预训练数据混合的框架,它通过自动化流程避免了手动标注或静态启发式方法。

🧩 该框架的核心流程包括:使用预训练编码器将大规模文本数据嵌入到语义空间,然后通过 K-means 聚类将数据组织成连贯的组。

🔍 CLIMB 使用代理模型评估候选混合,并使用回归预测器(例如 LightGBM)来估计混合性能,从而指导进一步的抽样和修剪,实现对混合空间的有效探索。

📈 实验结果表明,CLIMB 训练的模型在通用推理任务和特定领域基准测试中均优于其他方法,例如在 1B 参数模型上,平均准确率达到 60.41%,优于 DoReMi 和 RegMix 等基线。

📚 NVIDIA 发布了 ClimbLab(一个包含 1.2 万亿 token 的语料库,分为 20 个语义簇)和 ClimbMix(一个 4000 亿 token 的优化混合),以促进可复现性和进一步研究。

Challenges in Constructing Effective Pretraining Data Mixtures

As large language models (LLMs) scale in size and capability, the choice of pretraining data remains a critical determinant of downstream performance. Most LLMs are trained on large, web-scale datasets such as Common Crawl, which provide broad coverage but lack explicit domain labels. This introduces difficulties in curating mixtures that balance general knowledge with domain-specific expertise.

Manual dataset curation, as seen in efforts like The Pile, is labor-intensive and does not scale well. Moreover, the nonlinear relationship between data composition and model performance makes it non-trivial to determine what proportions of domain data are optimal. These constraints motivate the need for automated, scalable, and adaptive data selection methods.

CLIMB: An Iterative Framework for Data Mixture Discovery

To address this, NVIDIA researchers propose CLIMBCLustering-based Iterative Data Mixture Bootstrapping—a framework that automates the discovery and refinement of data mixtures for language model pretraining. CLIMB combines unsupervised clustering with iterative optimization to identify mixtures that are well-suited for general or domain-specific objectives.

The pipeline begins by embedding large-scale text data into a semantic space using pretrained encoders. K-means clustering is then applied to organize the data into coherent groups, which are pruned and merged based on content quality and redundancy. This forms the basis for constructing candidate mixtures.

Subsequently, CLIMB uses proxy models to evaluate sampled mixtures and fits a regression-based predictor (e.g., LightGBM) to estimate mixture performance. An iterative bootstrapping procedure progressively refines the sampling space, prioritizing high-performing configurations. This allows CLIMB to converge on an effective data mixture under a fixed compute budget.

Technical Details and Design Considerations

The optimization process is framed as a bi-level problem: at the lower level, proxy models are trained on candidate mixtures; at the upper level, a predictor is learned to approximate performance outcomes. This predictor guides further sampling and pruning, enabling efficient exploration of the mixture space.

CLIMB supports sparsity in mixture weights, encouraging the discovery of compact, domain-relevant data subsets. The use of clustering over embeddings—rather than token-level features—ensures semantic coherence within clusters. The iterative refinement is structured to balance breadth (search space coverage) with depth (predictive accuracy), and ablation studies confirm that careful compute allocation across iterations improves convergence and final performance.

The framework also exhibits robustness across proxy model sizes and cluster granularities. While larger proxy models yield slightly better predictions, even smaller models preserve key structural trends. Similarly, CLIMB is relatively insensitive to initial cluster count, provided it is within a reasonable range.

Empirical Evaluation and Observations

CLIMB was evaluated on several general reasoning tasks, including PIQA, ARC (Easy and Challenge), HellaSwag, and WinoGrande. A 1B-parameter model trained on CLIMB-discovered mixtures achieved an average accuracy of 60.41%, outperforming comparable baselines such as DoReMi and RegMix.

When extended to 400B-token pretraining, this 1B model outperformed Llama-3.2-1B by 2.0% on a broad suite of benchmarks. Similarly, in the sub-500M model category, CLIMB-based pretraining led to consistent improvements over models like SmolLM and TinyLlama.

Domain specialization further highlights CLIMB’s utility. In targeted MMLU benchmarks across STEM, humanities, and social sciences, CLIMB-trained models outperformed both random selection and exhaustive search baselines. The iterative process showed consistent gains over each stage, indicating effective guidance from the predictive model.

To facilitate reproducibility and further research, NVIDIA has released two resources:

Models trained on ClimbMix outperform those trained on datasets like Nemotron-CC and SmolLM under equivalent token budgets, demonstrating improved scaling characteristics.

Conclusion

CLIMB presents a systematic approach for optimizing data mixtures in LLM pretraining. By combining semantic clustering with proxy-based iterative search, it avoids reliance on manual annotations or static heuristics. The method supports both generalist and specialist training goals and adapts to varying compute and data constraints.

This framework contributes to ongoing efforts in data-centric AI by offering a scalable and principled alternative to handcrafted data pipelines. Its empirical performance underscores the importance of data mixture optimization in maximizing model utility, particularly under fixed resource budgets.


Check out the Paper, ClimbLab on HF and ClimbMix on HF . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post NVIDIA Introduces CLIMB: A Framework for Iterative Data Mixture Optimization in Language Model Pretraining appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CLIMB 大语言模型 预训练 数据混合 NVIDIA
相关文章