Published on February 10, 2025 5:55 PM GMT
Crossposted from my Substack.
Intuitively, simpler theories are better all else equal. It also seems like finding a way to justify assigning higher prior probability to simpler theories is one of the more promising ways of approaching the problem of induction. In some places, Solomonoff induction (SI) seems to be considered the ideal way of encoding a bias towards simplicity. (Recall: under SI, hypotheses are computable functions that spit out observations. Hypothesis h gets prior probability proportional to 2-K(h; L), where K(h; L) is the hypothesis’ Kolmogorov complexity in language L.)
But I find SI pretty unsatisfying on its own, and think there might be a better approach (not original to me) to getting a bias towards simpler hypotheses in a Bayesian framework.
Simplicity via hierarchical Bayes
- I’m not sure to what extent we need to directly bake in a bias towards simpler hypotheses in order to reproduce our usual inductive inferences or to capture the intuition that simpler theories tend to be better. Maybe we could at least get a long way with a hierarchically-structured prior, where:
- At the highest level, different theories T specify fundamental ontologies. For example, maybe the fundamental ontology of Ptolemaic astronomy was something like “The Earth is at the center of the universe, and all other bodies move along circles”.Each theory T contains many specific, disjoint hypotheses, corresponding to particular “parameter values” for the properties of the fundamental objects. For example, Ptolemaic astronomy as a high-level theory allows for many different planetary orbits.More complicated theories are those that contain many specific hypotheses. Complicated theories must spread out prior mass over more hypotheses, and if prior mass is spread evenly over the high-level theories, any individual hypothesis will get lower prior mass than individual hypotheses contained in simpler theories. I.e.:
- Let h1, h2 be hypotheses in T1, T2 respectively.Suppose T1 is simpler than T2. Then, generally we will have P(h1 | T1) > P(h2 | T2), because T2 has to spread out prior mass more thinly than T1.If P(T1) = P(T2), then we have P(h1) = P(h1 | T1)P(T1) > P(h2 | T2)P(T2) = P(h2).
- Intuitively, this doesn’t bother me a huge amount. Even if it ends up being underdetermined how to do this, my guess is that reasonable ways of individuating high-level theories will still constrain our inferences a lot. But, maybe not, I haven’t thought about it much.
Syntax vs. ontology
- SI assigns prior probabilities according to the syntax (in an arbitrary language) used to specify a theory. Setting aside the other problems for SI (e.g., see this post), I think this is pretty unsatisfactory as an attempt to capture our intuitive preference for simplicity, for a few reasons:
- First of all, I’d like to avoid just specifying by fiat that simpler hypotheses get higher prior probability and instead have this be a consequence of more solid principles. I think the principle of indifference is solid, if we can find a privileged parameterization of the hypothesis space to which we can apply the principle. The approach sketched above is attractive to me in this respect: We can try to apply a principle of indifference at the level of fundamental ontological commitments, which has the consequence that hypotheses contained in more complex theories get lower prior mass.
- So I would say: If you do want to directly assign prior probabilities to hypotheses according to their simplicity, you should start by looking at what the hypothesis actually says about the world and figure out how to measure the simplicity of that.
- My reply: This sounds like the streetlight effect. The reason that SI has a nice formalism is that it only looks at an easily-extracted property of a hypothesis (its syntax), and doesn’t attempt to extract the thing we should directly care about: what the hypothesis actually says about the world.Moreover, thinking in ontological terms may help make progress on one of the IMO serious problems for SI, the apparently arbitrary choice of language. For example, we may in the end decide that the best we can do is SI using a language that makes it easy to specify a hypothesis in terms of its ontology?
References
Rasmussen, Carl, and Zoubin Ghahramani. 2000. “Occam’s Razor.” Advances in Neural Information Processing Systems 13. https://proceedings.neurips.cc/paper/2000/hash/0950ca92a4dcf426067cfd2246bb5ff3-Abstract.html.
Discuss