Published on August 6, 2025 12:01 AM GMT
(This article is edited and expanded on from a comment I made to someone in the newly starting BAIF slack community. Thanks for the inspiration 🙏)
Introduction
In this article I present an alternative paradigm for Mechanistic Interpretability (MI). This paradigm may turn out to be better or worse, or naturally combine with the standard paradigm I often see implicitly extended from Chis Olah's "Zoom-In".
I've talked about this concept before, in various places. Someday I may collect them and try to present a strong case including a survey of paradigms in MI literature. For now, here is a relatively short introduction to the concept assuming some familiarity with ML and MI.
Afaik, Chris Olah originally introduced the concepts of "features" and "circuits" in "Zoom-In", as a suggestion for a direction for exploration, not as a certainty. It worked very well for thinking about things like "circle" and "texture" detectors, which I think are a natural, but incorrect way of understanding what is going on.
New Mechanistic Interpretability Paradigm?
I have been developing an alternate paradigm I'm not currently sure anyone else is talking about.
It is now common to think of the collective inputs or outputs of network layers as vectors rather than individual signals. The concept which I am uncertain anyone is focusing on, is that each vector is representative of a semantic space in which distributions live.
Input Space
For example, in a cat-dog-labeling net, the input space is images and there are two distributions living in this space. The cat-distribution is all possible images that are of cats. We can make some claims about that distribution, such as the idea that it is continuous and connected. The same thing is true of the dog-distribution, but additionally, the dog-distribution may be connected to the cat distribution in several spaces containing the set of images that are ambiguous, maybe a cat, maybe a dog. There is also implicitly a distribution of images that are neither dogs nor cats, but this can be ignored in simple examples.
Output Space
The output space has very different semantics. It is meant to label images as either a cat or dog, so it may be a 1-dim space where being near (1,) corresponds to "highly cat like" and near (0,) corresponds to "highly dog like" and anything in between could be "ambiguous" or "neither cat nor dog". If there is also training for "not cat or dog" the space might be 2-dim, with the point (1,0) meaning cat-like, and (0,1) meaning dog-like. Then "neither" would be (0,0) and "ambiguous" would be (0.5, 0.5). This kind of semantic space seems somehow intuitive to me. If you gave me a set of pictures to post on a cork-board I feel like I could do a fair approximation of this.
Details of Semantics
It seems note worthy that the actual input and output semantic spaces as understood by the network may be different from these semantic understandings based on the training dataset, with the possible distributions it could imply, and the dynamics of the training. For example, if the training puts no constraint on mapping inputs to locations like (10,0) or (-1,0) in the output, then what would the semantics of those locations mean? Would (10,0) correspond to something like "10 times as cat like"? My intuition is that there would instead be a messy distribution extending out in the (1,0) direction, and that distribution would be determined by the shape of the input distribution and the network architecture and training dynamics. In other words, it would be a result of whatever was the easiest way to separate the parts of the input distribution that need to be separated. The same applies to the (-1,0) direction. I do not expect this to have semantics meaning "the opposite of cat like". There may be something similar, especially if unsupervised rather than label methods are used, but it still has to do with the semantic distribution, not semantic directions.
Latent Semantic Spaces
Each of the latent spaces of the network could be understood as some step between the semantics of the two spaces. That might be an oversimplification, for example, there may be movement into a seemingly unrelated semantic space either to "untangle" distributions, or for some other reason. Even if it's not an oversimplification, there is a lot to be understood about what it means to step from the semantics of image-space to label-space.
Semantic Mappings
With this paradigm, the network is not understood in terms of neuron connections or circuits at all, but instead as "semantic mappings". Each layer of the network is an affine transformation (rotating, sliding, scaling, and skewing) that prepares it for the activation which "folds", "bends", "collapses", or "squashes" the parts of the space that have been moved into a negative orthant (generalization of a quadrant in n-dim). This also results in the squashing of any distributions that existed in that space. In other words, any inputs sampled from the distribution of possible inputs gets mapped to the corresponding location in the new squashed distribution. This results in the new, slightly transformed semantic space as the output of the layer.
The goal given to the network during it's training is to find the semantic mappings that transform the input semantic space (defined by the input dataset) into the output semantic space (defined by the labels to the dataset). What I think we have found, empirically, with neural networks, is that they are up to this task. They can, through a sequence of squashings, transform semantic spaces into very different looking semantic spaces.
So from within this paradigm, the answer to the questions:
- "What is a concept?" -- A boundary, or set of boundaries, around specific locations in semantic space. (I'm quite happy with this definition. It feels like it could be empirically accurate.)"What is a feature" -- An aspect of a latent semantic space that helps us understand the transformation of one semantic space into another. (this is fuzzy)"What is meaning" -- This is the semantics of the semantic spaces we are interested in. The existence of the distribution of cat pictures. The parts of that distribution corresponding to cats looking left vs right. The parts of that distribution that make a cat black vs orange. The distribution is there and all of our ideas of meaning exist within it.
If you finished reading this. Thanks!
Let me know what you think, and if you know of any work that seems related, please send me a link.
Discuss