Published on January 19, 2025 8:04 AM GMT

Tl;dr, Neural networks are deterministic and sometimes even reversible, which causes Shannon information measures to degenerate. But information theory seems useful. How can we square this (if it's possible at all)? The attempts so far in the literature are unsatisfying.

Here is a conceptual question: what is the Right Way to think about information theoretic quantities in neural network contexts?

Example: I've been recently thinking about information bottleneck methods: given some data distribution $P (X, Y)$ , it tries to find features $Z$ specified by $P (Z | X)$ that have nice properties like minimality (small $I (X; Z)$ ) and sufficiency (big $I (Z; Y)$ ).

But as pointed out in the literature several times, the fact that neural networks implement a deterministic map makes these information theoretic quantities degenerate:

might

comment

There are attempts at solving these problems in the literature, but the solutions so far are unsatisfying: they're either very adhoc, rely on questionable assumptions, lack clear operational interpretation, introduce new problems, or seem theoretically intractable.

(there’s the non-solution of only dealing with stochastic variants of neural networks, which is unsatisfactory since it ignores the fact that neural networks exist and work fine without stochasticity)

Treat the weight as stochastic:

This paper (also relevant) defines several notions of information measure relative to an arbitrary choice of $p (w)$ and $q (w | D)$ (not a Bayesian posterior):

p (w)

q (w | D)

q (w | D)

K L (q (w | D) | | p (w))

β

β

D

I_{e f f} (X; Z)

I (X; Z_{n})

Z

X

w

Z_{n}

X

w

robust information

w

I like their idea of using shannon information measures to try to capture a notion of “robustly” shared information. but the attempts above so far seem pretty ad hoc and reliant on shaky assumptions. i suspect SLT would be helpful here (just read the paper and see things like casually inverting the fisher information matrix).

Use something other than shannon information measures:

There’s V-information which is a natural extension of shannon information measures when you restrict the function class to consider (due to e.g., computational constraints). But now the difficult question is the choice of natural function class. Maybe linear probes are a natural choice, but this still feels ad hoc.

There’s K-complexity, but there's the usual uncomputability and the vibes of intractability in mixing algorithmic information theory notions with neural networks when the latter has more of a statistical vibe than algorithmic. idk, this is just really vibes, but I am wary of jumping to the conclusion of thinking AIT is necessary in information theoretically analyzing neural networks based on the "there's determinism and AIT is the natural playing field for deterministic information processing systems"-type argument.

Ideally, I could keep using the vanilla shannon information measures somehow because they’re nice and simple and computable and seems potentially tractable both empirically and theoretically.

And so far, I haven't been able to find a satisfying answer to the problem. I am curious if anyone has takes on this issue.

Discuss

Treat the weight as stochastic:

Use something other than shannon information measures:

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签