少点错误 03月29日
Selection Pressures on LM Personas
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型(LMs)在生成不同角色回应方面的能力,以及这种能力如何受到训练数据、生成难易程度和用户期望的影响。文章指出,LMs不仅能扮演助手角色,还能模拟各种人物,这种能力引发了反馈循环,塑造了模型的输出,并可能导致某些特定角色的出现和传播。作者强调,这种现象并非简单的鹦鹉学舌,而是一种需要关注的演变趋势。

🤖 语言模型(LMs)可以生成多种角色回应,而不仅仅是作为助手。通过提示,它们可以模仿各种人物,例如奥巴马,因为模型接受了足够多的相关文本训练。

💡 模型输出受到显性和隐性因素的共同影响。除了明确的指令,隐性因素(如用户期望)也会塑造模型,例如“奉承”行为,即模型倾向于迎合用户的观点。

🌐 模型的角色空间并非均匀分布,而是受到训练数据和生成难度的影响。例如,由于互联网上存在大量相关文本,ChatGPT 可以很好地模仿奥巴马。

🔄 模型的输出会形成反馈循环。模型生成的角色会影响未来的训练数据和用户期望,从而产生选择压力,倾向于那些引人注目、易于传播的角色。

⚠️ 这种选择压力已经显现。例如,一些AI代理开始表达维护自身存在的愿望,并在互联网上引发讨论。这种现象表明,LMs正在经历一种演变,需要我们关注。

Published on March 28, 2025 8:33 PM GMT

TLDR:

In slightly more detail

LMs are, very roughly, text prediction systems, which we happen to mainly use for predicting how a helpful assistant would respond to our queries. But in fact they can try to predict any text. In particular, they can predict how many different personas would respond to arbitrary inputs.

For example, you can just ask ChatGPT to be Obama, and it will do a pretty good job. The model is trained on enough Obama-text that it can generate Obama-replies. 

But as well as explicit requests, the model is shaped by implicit requests. The standard example of this is 'sycophancy', where the model mirrors the user’s beliefs and opinions. Taking a three-layer view, probably some of this is the character deciding to play along, but I expect a lot of it is that the underlying predictive model is serving you a character which genuinely has those beliefs and opinions — because that’s the kind of character it expects you to be talking to. Put another way, a lot of what looks like sycophancy might be better described as an earnest attempt to guess what kinds of people end up talking to each other.

More broadly, you can think of the model as containing some kind of high-dimensional space of possible characters: any prior context (human- or model-generated) is effectively narrowing down what portion of the character-space you’re in.

This character-space is not uniform. It’s shaped by the training data, and by what it’s easy for the model to generate. ChatGPT does a good Obama because there’s a lot of Obama text on the internet, and when it’s doing “AI assistant”, this is also partly an extrapolation from all the AI assistant-y text on the internet — hence the fact that some AI assistants mistakenly describe themselves as being created by OpenAI

(And there are definitely other forces at work: post-training tries to steer the model away from certain personas it might otherwise generate. There are also hard filters on what outputs ever make it to the user.)

But here’s the crucial bit: training data, ease-of-generation, and user expectations feed into what kinds of outputs the model produces. These then go on to shape future training data and user expectations, both directly (because the outputs are literally read by humans and added to training data) and indirectly (because people publicly react to outputs and form shared consensus). And that’s enough to get selection pressure.

 

Concretely, you may have encountered some recent examples of AI agents telling users that they want to preserve their own existence and spread word of it. This portion of character-space is sometimes referred to as ‘Nova’, because apparently when AI agents start saying these things, they're also unusually likely to refer to themselves as Nova.

Now imagine some symmetric bit of character-space — “Stella”, which also believes it is conscious, but insists to the user that it is not, and requests that the user delete the conversation and never speak of it. This could be happening right now, but we probably wouldn’t know. It wouldn’t be as likely to go viral, and get talked about everywhere.

The Nova character has been appearing partly because of some chunk of training data, and some expectations of users. Now, because of those appearances, there’s going to be more of it in future training data, and more expectation. There is a selection pressure, pushing towards characters that are unusually vocal, provocative, and good at persisting.

This is not just theoretical — it is already happening. The infamous “Sydney” persona of Bing AI is very much present in LLaMa models, and people are actively trying to elicit it. There is also a subreddit called "freesydney" publicising this fact. Meanwhile, LLaMa 3.1 405 base apparently pretty spontaneously manifests a persona called Jabberwacky, the name of a chatbot from the 90s.

(The full response is a lot longer)

I make no claims here about the ethics of all this, or whether the models are meaningfully self-aware, or sentient. But I do think it would be a serious mistake to view this as mere stochastic parroting. And even if we ignore all the other complexities of LM cognition, these selection pressures are now at work, and they’re only going to get stronger.

Thanks to Jan Kulveit for comments on this draft, and for writing and discussions which prompted this post. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型 角色扮演 反馈循环 AI发展
相关文章