Published on January 30, 2025 2:10 PM GMT
Short post today, which is part II.1 or my series on tempering and SLT (see part one here). In this post I’ll explain in a bit more detail the “in practice” connection that experiments should see between the learning coefficient spectrum, tempering, and empirical measurements of the learning coefficient. In future installments of this part I’ll explain a bit the theory behind this and how it relates to some notions inherent in the generalized “field theory” approach to modeling neural nets.
Practical measurements of the memorization-generalization spectrum
I’m trying to do less of the thing where I hide experimentally-relevant points behind a wall of theory, so let me try to explain the “upshots” of this part ahead of time, and talk about theory later (in future installments of part II of this series).
- Tempering is implemented in practice by sampling algorithms, usually variants of “SGLD” (Langevin gradient descent) in an ML context. Like for usual SGD, there are various optimization protocols that make it more efficient. There is a whole science of how to check if “sampling worked”, and sampling quality/best practice is an active area of research where the SLT crowd is making exciting progress. In my experience, sampling algorithms (at a minimum) work well for toy models, and agree with “expected results” when such expectations are known.Tempering works by gradually trading off performance for entropy (as will be explained below), in a way that is mathematically analogous to adding heat to a physical system. In practice, this means that tempering inductively “noises out” the least efficient circuits in a neural net, and it stops noising circuits when the increase in loss (compared to the initial fully-trained model) starts getting significantly higher than the “temperature” parameter.Tempering is a stochastic process. Often we’re interested in the “generic behavior” of a randomly selected tempered program (corresponding to running an experiment on specific system at some fixed temperature). In other cases, we may be interested in expectation values over tempered programs, performed in practice by averaging the programs encountered in one or more “sampling traces”.The result of tempering can be read off of the “circuit efficiency” spectrum, and conversely the spectrum of efficiencies (in the language of the “bucket of circuits” post these are the slopes, not the complete 2-dimensional data) can be read off of tempering measurements. The process of converting a “bucket of circuits” to a tempering prediction is as follows (with various modifications needed in various contexts):
- Consider a specific temperature t.Figure out the “log odds change” inherent in the loss. Note that this step is a little tricky and context-dependent; “generically” and in the high-data limit, it is given by where is the loss of the fully-trained model. Note that getting this function exactly right isn’t that important for experimentalists, as it is reasonable to instead manually tune the temperature until it puts you in a regime of interest.Inductively noise out the lowest-efficiency circuits until the total from the circuits matches the value .The prediction for the tempered model is now the result of noising out these “inefficient circuits”. In particular, running an interpretability experiment on the tempered model should be expected to fail if it extracted information about the noised-out circuits, and succeed if it extracted information about surviving circuits.The learning coefficient can now be recovered from sampling the loss for tempered models[1].
This recipe can be reversed to extract the circuit efficiency spectrum from empirical measurements of the tempering process.
- ^
Roughly: tempering means we ask the “loss precision isn’t much worse than t”, and the learning coefficient measures the variance. If improving the loss is very entropically expensive, then tempered NNs will be “very resistant” to increase the loss below their minimal allowed value, and this variance will be small. Note that for the conceptual cartoons I’m blurring out the difference between so-called “microcanonical” and “canonical” quantities, and real tempering has “soft” exponential cutoffs rather than exact “loss bounded by this value”-style effects.
Discuss