Published on July 27, 2025 4:11 PM GMT
Cooperative AI fails unless its words reliably point to the world. Unfortunately, it seems like the way models are built is assuming the problem is solved instead of solving it. We would need an LLM to reliably track the world by interacting with it before cooperation is possible - but such interaction itself is risky.
I’ll make the fuller argument below, but should note that this post is based on, and is explaining, my recent paper, “Language Models’ Hall of Mirrors Problem: Why AI Alignment Requires Peircean Semiosis” which focused much more on making the philosophical case. Here, I am instead focusing on more concrete issues and the implications for AI alignment, so I’m not discussing Peirce and his philosophy; if you want that, and a more explicit connection between linguistic theory and inner and outer alignment, read the paper.
Mediocre LLMs Lose At Mafia
There are different ways for models to collaborate, but we’ll focus on cooperation - which can emerge spontaneously. We can imagine Gemini, ChatGPT, and Claude all cooperating nicely, just interacting via text. We can imagine they are playing Mafia together, say, on a dedicated platform. The villagers can cooperate, and try to identify the mafia.
Now, what happens if one of the models hallucinates? That is, what happens when they diverge from the reality of the situation, perhaps assuming something false, or being biased against certain players due to their names, or any of the myriad other failures that LLMs are prone to. Looking at the transcripts, this is common. The first comment in the first game, here, proposes looking at the nonexistent previous day. The second comment then hallucinates a previous day that never happened. The villagers quickly end up losing. The next two are only slightly less egregious, and none lead to a villager win.
I’ve played Mafia, and yes, it’s really hard to end up cooperating against intentional deceit, but it’s even harder if you’re operating in divergent realities. And those realities can diverge pretty quickly when there’s no way for models to “touch grass” and look at reality.
To follow the analogy, to the extent that AI functions as a stochastic parrot with little ongoing ability to account for inputs, and has an incoherent model of the world it can’t correct, alignment isn’t unlikely, it’s impossible. The models can’t end up converging about reality via Aumann agreement when they have no shared priors, sharply limited context windows, and the models are putting together slop based on imagined events, instead of acting as rational agents updating on facts about an actual world.
Back to (Distorted, Mirrored, Socially Constructed) Reality
LLMs aren’t always quite this bad - but by default, they are reflecting something other than base reality when they operate. In linguistics and semiotics, the ability for words to refer to real things is “indexicality” - pointing to something outside of the text. Humans learn language based on interaction with reality, and their first words are typically referring to the things they know best - usually, mama or dada, but even if not, it’s usually something nice and concrete like ball, dog, bottle, or cat, or an instruction to someone, like no, up, a greeting, or a reaction to something, uh-oh. In contrast, LLMs are trained on massive datasets and any model they have of the world is built on top of a reflection of reality that they get from the text. And at least initially, these models were then trapped by their priors. They had no access to outside data to check their beliefs.
So the central conceit of the paper is that of a hall of mirrors, which describes how basic LLMs are, by default, trapped in this weirdly distorted mirror of reality. Their inputs are distorted by what humans bother writing down, and whatever biases those humans have - not to mention the fact that their training includes massive amounts of fiction, reddit posts, twitter, and similarly garbage input sources. Given the fun-house mirrors they get started with, it’s unsurprising how close they are to spewing racist conspiracy theories, engaging in psychological manipulation of users, and similarly worrying behaviors.
Humans, on the other hand, build on their object-level beginnings into the higher realms of social games and simulacra. Their words start being about the socially constructed reality, instead of the testable real world. But social consensus building is critical for social interaction, and you can’t play mafia without a pretty robust and multi-level theory of mind. And our entire lives, and economic systems, are built around consensus intersubjective ideas like money having value, and the social consensus about prices created by markets.
And in many ways, the things we care most about are ones that can only be accessed through interacting with non-falsifiable human-created text. Values aren’t facts, they are opinions, and there’s no way an LLM could possibly be aligned without that information. So it’s not that LLMs are involved in human constructed non-reality - it’s that they are epistemically isolated within the input. The critical problem is that they don’t have reliable ways to access base reality, or to check that their map matches the territory.
Semantic Slippage and Optimization Pressure
What could go wrong with a model that doesn’t have the ability to check that its ideas match external reality? Well, the first one is the one we see all of the time with LLMs; hallucinations. The model doesn’t know the difference between strings of tokens that refer to true facts, and strings that have high probability. The goal of RLHF and prompt engineering is to make the strings of tokens that we want the model to output, those that correspond to the desired behavior, have high probability in the model.
There’s a way to approximate optimally correct behavior across all possible inputs, but AIXI isn’t computable. However, fundamentally, the trick it uses is to get feedback from reality as it watches what happens, and find the simplest generating process which outputs the same thing. That is, it gets feedback from reality to improve its model. And the same thing happens with a machine learning model; gradient descent is used to make its outputs better match the goal function - and for generative models, that means resembling the training data. But as argued above, the training data is frozen, and can’t give live feedback.
So we use a trick, and generate data that looks like the types of feedback we want the model to receive. That’s essentially what we’re doing with RLHF and other techniques - trying to approximate the types of feedback that humans receive when they try things and fail. And as long as the things that the model actually does are similar to what that training data prepared it for, that should basically work. But the real world is really big and complicated, and it seems like we aren’t actually getting where we need to go.
Another worrying part is that we are trying to optimize for something we don’t understand, and setting the goal upfront. That’s a perfect setup for Goodharting. And this isn’t at all theoretical; sycophancy is the obvious example of optimizing for the observable signal of user approval. And the concern for models that don’t receive ongoing feedback from reality is similar - semantic slippage. That is, if the model is told to do something represented by a specific string of tokens, and the model is modifiable, it’s sometimes easier for the meaning of the tokens to change than to achieve the outcome. And the way to stop this is to ensure that the meaning of the tokens is tied to reality, not just an abstract representation from training.
Fixing this problem.
The paper argues that the last couple years of AI progress can in large part be framed as solving these problems, making the models increasingly coupled to reality. Extended context windows, tool use, retrieval augmented generation, all of these give LLMs somewhat more access to reality. Stored memories give them a somewhat greater level of permanence, which means they can, in theory, learn. (They are only learning about individual users, of course, but it’s still a level of sustained interaction with reality they previously lacked.)
Tool use and agentic models are the next step. Letting models check whether the code they write compiles, letting them interact with other systems, and giving them access to additional modalities arguably ties them to reality. But this is not a view-only access; if the actions of the LLM didn’t affect reality, we wouldn’t worry. But the critical step in getting feedback is modifying something, taking some action, and finding out from reality what happens. That’s the simple truth.
Another thing that interaction with reality accomplishes is enabling the models to get feedback from a community of people. I won’t spend time explaining the semiotic notion of truth as approximating reality being part of a community of truth seekers, but the basic argument is that alignment is going to require this type of social interaction, and we’d hope that it is corrigible, that is, willing to receive feedback and change its mind.
That said, there’s an obvious problem with building AI systems that need to interact with reality and try things to find out what is or isn’t true - it means they have access to the world. We can’t have an AI in a box that interacts with reality in this way. Which means we want to ensure that we trust the system first, before we give it access. But we shouldn’t trust a system that we don’t know actually understands the world. Similarly, making a model corrigible also means it is redirectable; giving a model the ability to modify its goals in response to human feedback means it’s able to modify its goals, which also enables misalignment of even previously trusted models.
These problems aren’t incidental, they are fundamental. There’s no way to make a model that interacts with reality but doesn’t affect reality. And there’s no way to make a model that can adapt its goal to what humans want in the future, based on feedback it receives, but make sure it’s not going to do so in a way we don’t like if it gets the wrong feedback. (If this is wrong, what would count as evidence that grounding and interaction isn’t necessary for safe cooperation?)
And if it is correct, we cannot have a safe model that doesn’t interact with reality and doesn’t get feedback, because it won’t be coupled with reality. But we can’t trust a model that interacts with reality unless we’re already been convinced its representations are valid and not likely to slip. This is not to conclude that the problem is unsolvable - it is just saying that the ways that model developers are coupling models to reality seems to assume we’ve solved the problem that we’re trying to fix by doing so.
Discuss