Published on May 30, 2025 1:21 AM GMT
The Geometry of LLM Logits (an analytical outer bound)
1 Preliminaries
Symbol | Meaning |
---|---|
width of the residual stream (e.g. 768 in GPT-2-small) | |
number of Transformer blocks | |
vocabulary size, so logits live in | |
residual-stream vector entering block | |
the update written by block | |
un-embedding matrix and bias |
Additive residual stream.With (pre-/peri-norm) residual connections,
Hence the final pre-logit state is the sum of contributions (block 0 = token+positional embeddings):
2 Each update is contained in an ellipsoid
Why a bound exists.Every sub-module (attention head or MLP)
- reads a LayerNormed copy of its input, so where and is that block’s learned scale;applies linear maps, a Lipschitz point-wise non-linearity (GELU, SiLU, …), and another linear map back to .
Because the composition of linear maps and Lipschitz functions is itself Lipschitz, there exists a constant such that
Define the centred ellipsoid
Then every realisable update lies inside that ellipsoid:
3 Residual stream ⊆ Minkowski sum of ellipsoids
Using additivity and Step 2,
where is the Minkowski sum of the individual ellipsoids.
4 Logit space is an affine image of that sum
Logits are produced by the affine map .For any sets ,
Hence
Because linear images of ellipsoids are ellipsoids, each is still an ellipsoid.
5 Ellipsotopes
An ellipsotope is an affine shift of a finite Minkowski sum of ellipsoids.The set
therefore is an ellipsotope.
6 Main result (outer bound)
Theorem.For any pre-norm or peri-norm Transformer language model whose blocks receive LayerNormed inputs, the set of all logit vectors attainable over every prompt and position satisfies
where is the ellipsotope defined above.
Proof.Containments in Steps 2–4 compose to give the stated inclusion; Step 5 shows the outer set is an ellipsotope. ∎
7 Remarks & implications
It is an outer approximation.Equality would require showing that every point of the ellipsotope can actually be realised by some token context, which the argument does not provide.
Geometry-aware compression and safety.Because is convex and centrally symmetric, one can fit a minimum-volume outer ellipsoid to it, yielding tight norm-based regularisers or robustness certificates against weight noise / quantisation.
Layer-wise attribution.The individual sets bound how much any single layer can move the logits, complementing “logit-lens’’ style analyses.
Assumptions.LayerNorm guarantees is bounded; Lipschitz—but not necessarily bounded—activations (GELU, SiLU) then give finite . Architectures without such norm control would require separate analysis.
Discuss