少点错误 39分钟前
The Croissant Principle: A Theory of AI Generalization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了机器学习中的数据缩放律和双下降现象,提出了一种基于潜在空间和生成映射的全新解释。研究认为,所有统计学习问题中的数据都源于潜在空间,并基于此推导出真实损失的上限。通过引入豪斯多夫维度的概念,作者解释了数据缩放遵循幂律的原因,并给出了确切的指数形式。此外,文章还解释了双下降现象,认为随着模型参数的增加,模型空间扩大,从而包含了既能很好地拟合数据又具有低复杂度的模型,导致评估损失再次下降。该研究独立于模型架构、任务和损失函数,为理解现代机器学习中的神秘现象提供了新的视角。

💡文章的核心观点是,所有统计学习数据都来自潜在空间,通过生成映射产生。

📐研究利用豪斯多夫维度来解释数据缩放遵循幂律的原因,并推导出具体的指数形式,为数据缩放律提供了理论依据。

📉针对双下降现象,文章提出随着模型参数增加,模型空间扩大,包含低复杂度模型,从而解释了评估损失再次下降的现象。

🌐该研究的分析不依赖于特定的模型架构、任务或损失函数,适用于所有统计学习方法,具有广泛的适用性。

Published on June 22, 2025 5:58 PM GMT

I recently wrote an ML theory paper which proposes explanations for mysterious phenomena in contemporary machine learning like data scaling laws and double descent. Here's the link to the paper and the Twitter thread. I didn't get much attention and need an endorser to publish on ArXiv so I thought I'd post it here and get some feedback (and maybe an endorser!)

Essentially what the paper does is propose that all data in a statistical learning problem arises from a latent space via a generative map. From this we derive an upper bound on the true loss as depending on the training/empirical loss, the distance in latent space to the closest training sample where the model attains better than the training loss, and the compressibility of the model (similar to Kolmogorov complexity).

Barring a (reasonable) conjecture which nonetheless is not proved, we are able to explain why data scaling follows a power law as well as the exact form of the exponent. The intuition comes from Hausdorff dimension which measures the dimension of a metric space.

Imagine you are building a model with 1-dimensional inputs, let's say in the unit interval . Let's say you have ten training samples distributed evenly. If the loss of your model is Lipschitz (doesn't change unboundedly fast e.g. for smooth enough functions, derivative is bounded), your model can't get loss on any test sample greater than the loss at the closest point plus the distance to that point (capped at around 1/10) times the Lipschitz constant (bound on the derivative).

If you want to improve generalization, you can sample more data. If these are spaced optimally (evenly), the maximum distance to a training sample decreases like  as can be easily seen. However, if you were working with 2 dimensional data, it would scale like ! Hausdorff dimension essentially defines the dimension of a metric space as the number  such that this scales like .

If you now put these two facts together, you get that the generalization gap (gap between true loss and training loss) is  where  is the number of training data samples and  is the Hausdorff dimension of the latent space. In other words, we have a concrete explanation of data scaling laws!

It's worth noting that this analysis is independent of architecture, task, loss function (mostly) and doesn't even assume that you're using a neural network! It applies to all statistical learning methods. So that's pretty cool!

The second major phenomenon we can explain is double descent and in fact utilizes an existing framework. Double descent is the phenomenon where as relative parameters per data sample increase, first eval loss decreases then increases, as classical learning theory predicts, but then decreases again! This last part has been quite the mystery in modern ML.

We propose an explanation. The generalization gap has long been known to be bounded by a term depending on the complexity of the model. For small models, increasing parameters helps fit the data better, driving down training and eval loss. Eventually you start to overfit and the complexity skyrockets, causing eval loss to rise. However as you continue increasing parameters, the space of possible models continues to expand so that it now contains models which both fit the data well and have low complexity! This drives eval loss down again, if you can find these models. This fits with empirical observations that enormous models are simple (have low-rank weight matrices) and that sparse subnetworks can do just as well as the full model and the existence of abnormally important "superweights".

So yeah, we provide plausible explanations for two major mysteries of machine learning. I think that's really cool! Unfortunately I don't really have the following or institutional affiliation to get this widely noticed. I'd love your feedback! And if you think it's cool too, I'd really appreciate you sharing it with other people, retweeting the thread, and offering to endorse me so I can publish this on arXiv!

Thanks for your time!



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器学习 数据缩放律 双下降现象 潜在空间 豪斯多夫维度
相关文章