Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs

cs.AI updates on arXiv.org 21小时前

Sub-Scaling Laws: On the Role of Data Density and Training Strategies in LLMs

文章分析了自然语言处理中模型性能提升放缓的子缩放现象，指出数据质量和训练策略对模型性能的影响，并提出了适用于子缩放环境的次优缩放定律。

arXiv:2507.10613v1 Announce Type: cross Abstract: Traditional scaling laws in natural language processing suggest that increasing model size and training data enhances performance. However, recent studies reveal deviations, particularly in large language models, where performance improvements decelerate, which is a phenomenon known as sub-scaling. This paper revisits these scaling laws by examining the impact of data quality and training strategies on model performance. Through extensive empirical analysis of over 400 models, we identify high data density and non-optimal resource allocation as key factors contributing to sub-scaling. High data density leads to diminishing returns due to redundant information, while optimal resource allocation is crucial for sustained performance improvements. We propose a sub-optimal scaling law that better predicts performance in sub-scaling regimes, highlighting the importance of data quality and diversity.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

自然语言处理模型性能子缩放现象数据质量训练策略

相关文章

阿里云：通义千问API日调用量破亿企业用户破9万

Microsoft unveils Phi-3 family of compact language models

Import AI 362: Amazon’s big speech model; fractal hyperparameters; and Google’s open models

Exploring the Frontiers of AI: The Emergence of LLM-4 Architectures

Implement RAG Using Weaviate, LangChain4j, and LocalAI

AI Revolution Journey With Qwen, RAG, and LangChain

Extracting Keywords From Text Using Natural Language Processing

Is There a Library for Cleaning Data before Tokenization? Meet the Unstructured Library for Seamless Pre-Tokenization Cleaning

Learn AI Together — Towards AI Community Newsletter #23

Top Important LLM Papers for the Week from 29/04 to 05/05