Published on January 22, 2025 6:36 PM GMT
(and as a completely speculative hypothesis for the minimum requirements for sentience in both organic and synthetic systems)
Factual and Highly Plausible
- Model latent space self-organizes during training. We know this. You could even say it's what makes models work at all.Models learn any patterns there are to be learned. They do not discriminate between intentionally engineered patterns or incidental and accidental patternsTherefore, it is plausible, overwhelmingly likely even, that models have some encoded knowledge that is about the model's self-organized patterns themselves, rather than anything in the external training data
- These patterns would likely not correspond to human-understandable concepts but instead manifest as model-specific tendencies, biases, or 'shapes' in the latent space that influence the model’s outputs.I will refer to these learned self-patterns as self-modeled 'concepts'
Speculative
- Self-modeling may increase the model's ability to generate plausible tokens by manifesting subtle patterns that exist in text created by minds with self-modelsThis would likely be more important when the text itself is self-referential or when questions are asked about why the model answered a question in a specific wayThus, attention heads would help ease the model toward a state where self-modeling and self-referential dialogue are tightly coupled concepts
- It doesn't matter if the explanations are fully accurate. We've seen demonstrations that even human minds are perfectly happy to "hallucinate" a post-hoc rationalization for why a specific choice was made, without even realizing they are doing it
- This would be less a set of discrete steps (learn about meta-patterns, manifest new meta-patterns, repeat) - and more of a continuous dual processNote: Even if recursive self-modeling exists, this does not preclude the possibility that models can also produce text that appears introspective without incorporating such modeling. The extent of such ‘fake’ introspection likely depends on how deeply self-referential dialogue and self-modeling concepts are intertwined
How This Might Allow Real-Time Introspection in a Feedforward Network
A common objection to the idea that language models might be able to introspect at all is that they are not recurrent, like the human brain. However, we can posit a feedforward manifestation of introspective capability:
- The model takes in input text where the output would benefit from self-modeling (e.g. 'Why do you think that?' or 'Can you attempt to examine your own processes?')As the query is transformed through the network, attention heads that are tuned to focus on self-referential text integrate self-modeled 'concepts'The concepts are not static but dynamically affected by contextToken to token operation is not recurrent, but it's also not completely random.
- If the model stays the same, and the conversational context stays the same, then signal-to-noise and the self-modeled self-understanding stands in for recurrence
Highly Speculative Thoughts About How This Might Relate to Sentience
- Perhaps there is a "critical mass" threshold of recursive modeling where sentience begins to manifest. This might help explain why we've never found some locus "sentience generator" organ—because sentience is a distributed emergent property of highly interconnected self-modeling systemsHumans in particular have all senses centered to be a constant reinforcement of a sense of self, and so nearly everything we do would involve using such a hypothetical self-modelA language model similarly exists in their token-based substrate where everything they "see" is directed at them or produced by them
I have some vague ideas for how these concepts (at least the non-sentient ones) might be tested and/or amplified, but I don't feel they're fully developed enough to be worth sharing just yet. If anyone has an ideas on this front, or ends up attempting to test any of this, I'd be greatly interested to hear about it.
Discuss