少点错误 01月30日
Memorization-generalization in practice
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了退火(Tempering)技术与学习系数在实践中的联系。退火通过采样算法,如SGLD,在机器学习中实现,它通过牺牲性能换取熵,类似于向物理系统添加热量。退火过程逐步“噪声化”神经网络中效率较低的电路,并在损失增加显著高于温度参数时停止。文章还介绍了如何通过电路效率谱读取退火结果,以及如何从退火测量中提取电路效率谱。此外,文章还解释了学习系数如何从退火模型的损失采样中恢复,并讨论了退火过程中的随机性以及如何进行实验测量。

🔥退火技术通过采样算法(如SGLD)实现,在机器学习中通过牺牲性能来换取熵,这与向物理系统添加热量在数学上是相似的。

📉退火过程会逐步“噪声化”神经网络中效率较低的电路,当损失增加明显高于“温度”参数时停止噪声化,以此来优化模型。

📊可以通过“电路效率”谱来读取退火的结果,反之,也可以通过退火的测量结果来反推电路效率谱。这个过程包括根据特定温度计算损失的对数几率变化,并噪声化低效率电路,直到总变化匹配该值。

🧪学习系数可以通过对退火模型的损失进行采样来恢复,这可以帮助我们理解退火过程对模型性能的影响。

📈退火过程是一个随机过程,我们可以通过对多个采样轨迹中的程序进行平均来研究退火程序的期望值,从而理解退火的泛化行为。

Published on January 30, 2025 2:10 PM GMT

Short post today, which is part II.1 or my series on tempering and SLT (see part one here). In this post I’ll explain in a bit more detail the “in practice” connection that experiments should see between the learning coefficient spectrum, tempering, and empirical measurements of the learning coefficient. In future installments of this part I’ll explain a bit the theory behind this and how it relates to some notions inherent in the generalized “field theory” approach to modeling neural nets.

Practical measurements of the memorization-generalization spectrum 

I’m trying to do less of the thing where I hide experimentally-relevant points behind a wall of theory, so let me try to explain the “upshots” of this part ahead of time, and talk about theory later (in future installments of part II of this series).

    Tempering is implemented in practice by sampling algorithms, usually variants of “SGLD” (Langevin gradient descent) in an ML context. Like for usual SGD, there are various optimization protocols that make it more efficient. There is a whole science of how to check if “sampling worked”, and sampling quality/best practice is an active area of research where the SLT crowd is making exciting progress. In my experience, sampling algorithms (at a minimum) work well for toy models, and agree with “expected results” when such expectations are known.Tempering works by gradually trading off performance for entropy (as will be explained below), in a way that is mathematically analogous to adding heat to a physical system. In practice, this means that tempering inductively “noises out” the least efficient circuits in a neural net, and it stops noising circuits when the increase in loss (compared to the initial fully-trained model) starts getting significantly higher than the “temperature” parameter.Tempering is a stochastic process. Often we’re interested in the “generic behavior” of a randomly selected tempered program (corresponding to running an experiment on specific system at some fixed temperature). In other cases, we may be interested in expectation values over tempered programs, performed in practice by averaging the programs encountered in one or more “sampling traces”.The result of tempering can be read off of the “circuit efficiency” spectrum, and conversely the spectrum of efficiencies (in the language of the “bucket of circuits” post these are the slopes, not the complete 2-dimensional data) can be read off of tempering measurements. The process of converting a “bucket of circuits” to a tempering prediction is as follows (with various modifications needed in various contexts):
      Consider a specific temperature t.Figure out the “log odds change” inherent in the loss. Note that this step is a little tricky and context-dependent; “generically” and in the high-data limit, it is given by where is the loss of the fully-trained model. Note that getting this function exactly right isn’t that important for experimentalists, as it is reasonable to instead manually tune the temperature until it puts you in a regime of interest.Inductively noise out the lowest-efficiency circuits until the total from the circuits matches the value .The prediction for the tempered model is now the result of noising out these “inefficient circuits”. In particular, running an interpretability experiment on the tempered model should be expected to fail if it extracted information about the noised-out circuits, and succeed if it extracted information about surviving circuits.The learning coefficient can now be recovered from sampling the loss for tempered models[1].


This recipe can be reversed to extract the circuit efficiency spectrum from empirical measurements of the tempering process.

  1. ^

    Roughly: tempering means we ask the “loss precision isn’t much worse than t”, and the learning coefficient measures the variance. If improving the loss is very entropically expensive, then tempered NNs will be “very resistant” to increase the loss below their minimal allowed value, and this variance will be small. Note that for the conceptual cartoons I’m blurring out the difference between so-called “microcanonical” and “canonical” quantities, and real tempering has “soft” exponential cutoffs rather than exact “loss bounded by this value”-style effects.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

退火 学习系数 神经网络 采样算法 电路效率谱
相关文章