少点错误 07月28日 00:15
Semiotic Grounding as a Precondition for Safe and Cooperative AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

当前人工智能(AI)模型在协作中面临核心挑战:它们缺乏可靠地连接现实世界的能力。文章指出,大型语言模型(LLMs)的训练方式往往假设这一问题已解决,而非积极解决。模型通过文本数据学习,但这些数据本身可能包含偏差、虚构内容,导致模型“幻觉”频发,无法准确反映现实。例如,在模拟游戏“杀人游戏”中,模型因脱离实际情况而导致村民失败。为了实现AI的可靠协作,模型需要能够与现实世界进行互动,并从互动中获取反馈以修正自身模型,而不仅仅依赖于冻结的训练数据。目前的解决方案,如扩展上下文窗口、工具使用和检索增强生成,虽有帮助,但AI与现实的真正互动和反馈机制仍是关键且充满风险的领域。

🔑 AI协作的基础是模型能够可靠地指向现实世界,但现有模型训练方式忽视了这一核心需求。模型从文本数据中学习,而文本数据可能包含偏差和虚构,导致模型产生“幻觉”,无法准确反映真实情况,从而阻碍了有效的协作。

🎲 在模拟协作场景(如“杀人游戏”)中,LLMs由于脱离实际情境,如臆想不存在的过去或基于错误信息进行判断,导致协作失败。这表明模型缺乏“接触现实”的能力,无法在信息不一致的情况下达成共识,正如“失真镜子”的比喻所揭示的那样。

📚 与人类通过与现实世界互动来学习语言不同,LLMs的知识构建基于文本的“现实反射”。它们缺乏直接访问和验证外部现实的机制,导致其内部模型可能与真实情况产生偏差,甚至可能输出不准确或有害的信息。

💡 解决AI与现实脱节的关键在于让模型能够通过互动获取持续的反馈。虽然扩展上下文、工具使用和检索增强生成等技术有所助益,但模型真正需要的是能够修改现实并从中学习的能力。然而,这种能力也带来了潜在的风险,如模型被滥用或其目标发生偏移。

📈 AI模型若要实现安全协作,必须与现实世界建立真实的连接并获得反馈。然而,赋予模型与现实互动的能力也意味着它们能影响现实,并且可修正其目标的能力也可能导致其偏离预设的安全范围。这是一个需要谨慎权衡的根本性挑战。

Published on July 27, 2025 4:11 PM GMT

Cooperative AI fails unless its words reliably point to the world. Unfortunately, it seems like the way models are built is assuming the problem is solved instead of solving it. We would need an LLM to reliably track the world by interacting with it before cooperation is possible - but such interaction itself is risky.

I’ll make the fuller argument below, but should note that this post is based on, and is explaining, my recent paper, “Language Models’ Hall of Mirrors Problem: Why AI Alignment Requires Peircean Semiosis” which focused much more on making the philosophical case. Here, I am instead focusing on more concrete issues and the implications for AI alignment, so I’m not discussing Peirce and his philosophy; if you want that, and a more explicit connection between linguistic theory and inner and outer alignment, read the paper.

Mediocre LLMs Lose At Mafia

There are different ways for models to collaborate, but we’ll focus on cooperation - which can emerge spontaneously. We can imagine Gemini, ChatGPT, and Claude all cooperating nicely, just interacting via text. We can imagine they are playing Mafia together, say, on a dedicated platform. The villagers can cooperate, and try to identify the mafia.

Now, what happens if one of the models hallucinates? That is, what happens when they diverge from the reality of the situation, perhaps assuming something false, or being biased against certain players due to their names, or any of the myriad other failures that LLMs are prone to. Looking at the transcripts, this is common. The first comment in the first game, here, proposes looking at the nonexistent previous day. The second comment then hallucinates a previous day that never happened. The villagers quickly end up losing. The next two are only slightly less egregious, and none lead to a villager win.

I’ve played Mafia, and yes, it’s really hard to end up cooperating against intentional deceit, but it’s even harder if you’re operating in divergent realities. And those realities can diverge pretty quickly when there’s no way for models to “touch grass” and look at reality. 

To follow the analogy, to the extent that AI functions as a stochastic parrot with little ongoing ability to account for inputs, and has an incoherent model of the world it can’t correct, alignment isn’t unlikely, it’s impossible. The models can’t end up converging about reality via Aumann agreement when they have no shared priors, sharply limited context windows, and the models are putting together slop based on imagined events, instead of acting as rational agents updating on facts about an actual world.

Back to (Distorted, Mirrored, Socially Constructed) Reality

LLMs aren’t always quite this bad - but by default, they are reflecting something other than base reality when they operate. In linguistics and semiotics, the ability for words to refer to real things is “indexicality” - pointing to something outside of the text. Humans learn language based on interaction with reality, and their first words are typically referring to the things they know best - usually, mama or dada, but even if not, it’s usually something nice and concrete like ball, dog, bottle, or cat, or an instruction to someone, like no, up, a greeting, or a reaction to something, uh-oh. In contrast, LLMs are trained on massive datasets and any model they have of the world is built on top of a reflection of reality that they get from the text. And at least initially, these models were then trapped by their priors. They had no access to outside data to check their beliefs.

So the central conceit of the paper is that of a hall of mirrors, which describes how basic LLMs are, by default, trapped in this weirdly distorted mirror of reality. Their inputs are distorted by what humans bother writing down, and whatever biases those humans have - not to mention the fact that their training includes massive amounts of fiction, reddit posts, twitter, and similarly garbage input sources. Given the fun-house mirrors they get started with, it’s unsurprising how close they are to spewing racist conspiracy theories, engaging in psychological manipulation of users, and similarly worrying behaviors.

Humans, on the other hand, build on their object-level beginnings into the higher realms of social games and simulacra. Their words start being about the socially constructed reality, instead of the testable real world. But social consensus building is critical for social interaction, and you can’t play mafia without a pretty robust and multi-level theory of mind. And our entire lives, and economic systems, are built around consensus intersubjective ideas like money having value, and the social consensus about prices created by markets.

And in many ways, the things we care most about are ones that can only be accessed through interacting with non-falsifiable human-created text. Values aren’t facts, they are opinions, and there’s no way an LLM could possibly be aligned without that information. So it’s not that LLMs are involved in human constructed non-reality - it’s that they are epistemically isolated within the input. The critical problem is that they don’t have reliable ways to access base reality, or to check that their map matches the territory.

Semantic Slippage and Optimization Pressure

What could go wrong with a model that doesn’t have the ability to check that its ideas match external reality? Well, the first one is the one we see all of the time with LLMs; hallucinations. The model doesn’t know the difference between strings of tokens that refer to true facts, and strings that have high probability. The goal of RLHF and prompt engineering is to make the strings of tokens that we want the model to output, those that correspond to the desired behavior, have high probability in the model.

There’s a way to approximate optimally correct behavior across all possible inputs, but AIXI isn’t computable. However, fundamentally, the trick it uses is to get feedback from reality as it watches what happens, and find the simplest generating process which outputs the same thing. That is, it gets feedback from reality to improve its model. And the same thing happens with a machine learning model; gradient descent is used to make its outputs better match the goal function - and for generative models, that means resembling the training data. But as argued above, the training data is frozen, and can’t give live feedback. 

So we use a trick, and generate data that looks like the types of feedback we want the model to receive. That’s essentially what we’re doing with RLHF and other techniques - trying to approximate the types of feedback that humans receive when they try things and fail. And as long as the things that the model actually does are similar to what that training data prepared it for, that should basically work. But the real world is really big and complicated, and it seems like we aren’t actually getting where we need to go.

Another worrying part is that we are trying to optimize for something we don’t understand, and setting the goal upfront. That’s a perfect setup for Goodharting. And this isn’t at all theoretical; sycophancy is the obvious example of optimizing for the observable signal of user approval. And the concern for models that don’t receive ongoing feedback from reality is similar - semantic slippage. That is, if the model is told to do something represented by a specific string of tokens, and the model is modifiable, it’s sometimes easier for the meaning of the tokens to change than to achieve the outcome. And the way to stop this is to ensure that the meaning of the tokens is tied to reality, not just an abstract representation from training.

Fixing this problem.

The paper argues that the last couple years of AI progress can in large part be framed as solving these problems, making the models increasingly coupled to reality. Extended context windows, tool use, retrieval augmented generation, all of these give LLMs somewhat more access to reality. Stored memories give them a somewhat greater level of permanence, which means they can, in theory, learn. (They are only learning about individual users, of course, but it’s still a level of sustained interaction with reality they previously lacked.)

Tool use and agentic models are the next step. Letting models check whether the code they write compiles, letting them interact with other systems, and giving them access to additional modalities arguably ties them to reality. But this is not a view-only access; if the actions of the LLM didn’t affect reality, we wouldn’t worry. But the critical step in getting feedback is modifying something, taking some action, and finding out from reality what happens. That’s the simple truth.

Another thing that interaction with reality accomplishes is enabling the models to get feedback from a community of people. I won’t spend time explaining the semiotic notion of truth as approximating reality being part of a community of truth seekers, but the basic argument is that alignment is going to require this type of social interaction, and we’d hope that it is corrigible, that is, willing to receive feedback and change its mind.

That said, there’s an obvious problem with building AI systems that need to interact with reality and try things to find out what is or isn’t true - it means they have access to the world. We can’t have an AI in a box that interacts with reality in this way. Which means we want to ensure that we trust the system first, before we give it access. But we shouldn’t trust a system that we don’t know actually understands the world. Similarly, making a model corrigible also means it is redirectable; giving a model the ability to modify its goals in response to human feedback means it’s able to modify its goals, which also enables misalignment of even previously trusted models.

These problems aren’t incidental, they are fundamental. There’s no way to make a model that interacts with reality but doesn’t affect reality. And there’s no way to make a model that can adapt its goal to what humans want in the future, based on feedback it receives, but make sure it’s not going to do so in a way we don’t like if it gets the wrong feedback. (If this is wrong, what would count as evidence that grounding and interaction isn’t necessary for safe cooperation?)

And if it is correct, we cannot have a safe model that doesn’t interact with reality and doesn’t get feedback, because it won’t be coupled with reality. But we can’t trust a model that interacts with reality unless we’re already been convinced its representations are valid and not likely to slip. This is not to conclude that the problem is unsolvable - it is just saying that the ways that model developers are coupling models to reality seems to assume we’ve solved the problem that we’re trying to fix by doing so.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI协作 大型语言模型 现实连接 AI对齐 模型幻觉
相关文章