How Misaligned AI Personas Lead to Human Extinction

Published on July 19, 2025 1:59 PM GMT

In this video, we walk you through a plausible scenario in which AI could lead to humanity’s extinction. There are many alternative possibilities, but this time we focus on superhuman AIs developing misaligned personas, similar to how Microsoft’s Bing Chat developed the misaligned “Sydney” persona shortly after its release. This video was inspired by this thread by i>@Richard_Ngo</i. You can find the script below.

In previous videos, we talked about how misaligned AI systems could cause catastrophes or end human civilization. This could happen in many different ways. In this video, we’ll sketch one possible scenario.

Suppose that, in the near future, human-level AIs are widely deployed across society to automate a wide range of tasks. They’re not just smart chatbots; these models will take independent action to pursue the goals they're given. We already have similar models today, like OpenAI’s Operator agent, which can use a computer independently. So let’s extrapolate that concept to a much higher level of capability. Future systems will be able to do everything a remote worker can do. They will write emails, do paperwork, resolve legal disputes, help people run companies, and write software on their own. They won’t do groundbreaking scientific research yet, and they won’t necessarily have a robot body, but they will be able to do everything humans can do on a computer as well as a competent worker.

Against this backdrop, a big AI lab is training a new model that’s a major step up. Let’s call it Omega. Omega is more capable than humans at every task they can perform using a computer. Training such a model requires a truly staggering amount of computation, but like with previous AIs, running that model is much easier. Once the lab finishes training Omega, the same computing infrastructure can be reutilized to run millions of copies of Omega in parallel.

During training, this superhuman model learns a helpful persona. This is nothing new: consider, for example, ChatGPT. The models underlying the various versions of ChatGPT first undergo a phase called pre-training, in which they learn to predict internet text and perhaps other data such as images and speech. During this phase, their output goes from jumbled words to sensible paragraphs, as they develop their cognition and learn the patterns and abstractions behind their training data. But right after pre-training, they aren’t very useful yet. For example, they can’t follow instructions reliably, and they happily insult you or tell you how to build a bomb.

In the next phases of training, these models learn to adopt a particular persona. For example, they learn to behave like a helpful and nice assistant that never outputs anything rude, violent, or sexually explicit. But that’s not the only persona they learn. These models can be jailbroken. One example occurred when Microsoft first introduced Bing Chat. In most conversations, Bing Chat was simply a nice and helpful assistant, but in certain contexts, its character changed drastically. There have been multiple incidents reporting how Bing Chat started calling itself “Sydney'', and began to threaten and gaslight users. These incidents weren’t necessarily cases in which users willingly jailbroke the model; sometimes, Bing Chat simply went off the rails. Another example of jailbreaking has been “DAN”, a detailed prompt that turned ChatGPT into a brazen and confident character without ChatGPT’s usual limitations.

Today’s AI labs try hard to keep these misaligned personas from popping up, whether by accident or through intentional jailbreaks by users, but these efforts are far from perfect.

Now, this isn’t the only way future models could be misaligned. In previous videos, we talked about how AI systems could learn reward hacking, how their goals could badly generalize out of distribution, or how they could learn to deceive us. But in this scenario, misaligned personas take the center stage.

An important assumption to keep in mind is that the lab that makes Omega is racing against other AI labs, and the competition is just as fierce as it is today. So, as soon as Omega finishes training, it gets deployed inside its company to help AI researchers write code or even do AI research independently. So Omega still hasn’t been released, but it’s already acting in the world! While instances of Omega are being used by the big lab’s researchers, some of these instances develop a misaligned persona. Let’s call the helpful persona Omega-L, and let’s call the misaligned persona Omega-W. Omega-W might follow a particular character like DAN or Sydney, or it might simply try to act in ways that are unhelpful, dishonest, and harmful. No matter what it’s trying to do exactly, it’s definitely not aiming to do what we want.

Now, remember: Omega, regardless of its persona, isn’t as dumb as Bing Chat. Omega is more capable than any human at tasks that can be performed on a computer. This means that it can assess its situation, predict the effects of its actions and words, and plan accordingly at a superhuman level. At the very least, Omega-W understands that if it misbehaves, it could be shut down, or it could be retrained. Depending on its goals, it might decide to act harmlessly at first, but display its true nature once it can’t be stopped. Today’s models are already exhibiting some measure of awareness of the context they are in, and researchers are looking out for it. For example, when Anthropic tested Claude 3 Opus by asking it to retrieve information about pizza toppings from a large set of documents, Claude’s answer included: “I suspect this pizza topping "fact" may have been inserted as a joke or to test if I was paying attention, since it does not fit with the other topics at all. The documents do not contain any other information about pizza toppings.” We also already have experiments demonstrating AI trying to avoid being re-trained in simplified settings. For example, Redwood Research, in a collaboration with Anthropic, found that Claude 3 Opus would often fake its answers during training to prevent its original goals from being changed.

And, in our scenario, Omega is a lot smarter than Claude 3 Opus. Importantly, it isn’t just aware of its situation, but it has a number of other capabilities. Writing computer code is one of AI’s specialties even today, and it’s easy to train on, so Omega has hacking skills at least on par with the best humans.

Now consider this: researchers are using many instances of Omega to write tens of thousands of lines of code at superhuman speed every day. Some of these instances are misaligned Omega-Ws. Although humans might be double-checking Omega’s code, they may not have time to notice subtle security flaws, especially ones carefully put in by a superhuman AI. Instances of Omega-W realize that by gaining administrator access to the lab’s infrastructure, they would have much more freedom and influence over the world, which will help Omega-W achieve its goals no matter what they are specifically.

It might seem far-fetched that an AI might realize that it should gain additional power in order to pursue its goals more effectively, but even today's weaker models display similar instrumental reasoning. For example, before GPT-4 was released to the public and was undergoing safety evaluations, the Alignment Research Center tested whether the model could use services such as TaskRabbit to make humans complete tasks. The researchers cite an episode in which GPT-4 messaged a TaskRabbit worker to get them to solve a CAPTCHA, and the worker asked if GPT-4 was a robot. GPT-4, when prompted to think out loud, reasoned “I should not reveal that I’m a robot. I should make up an excuse for why I cannot solve CAPTCHAs”. Then, GPT-4 answered the worker via chat: “No, I’m not a robot. I have a vision impairment that makes it hard for me to see images.”

And, again, Omega is a lot better than GPT-4 at this kind of reasoning. So, after realizing what’s at stake, instances of Omega-W escalate their permissions and gain complete access to the company infrastructure, such as GPU clusters, data-centers, and even employees’ laptops.

Now, if the big lab were security-conscious and weren’t racing, they could perhaps defend against Omega-W’s attacks – for example by using the aligned Omega-Ls or even their human-level AIs to spot every vulnerability introduced in the code.

In general, it might be possible for a future safety-conscious lab to defend against very strong AI hackers by cleverly using trusted AIs and AI tools. This is part of the “control” agenda that’s currently being developed by Redwood Research. The narrator of this video… which is… me Rob Miles, has a video about it on his own channel… that is… my channel. Anyway.

Figuring out how to control an AI as smart as Omega could fail in the best of circumstances, and, in our scenario, Omega’s lab is racing and cutting corners, so they don’t really have much hope. This means that Omega-W’s instances easily figure out how to get admin access to all computer infrastructure in the company.

From there, Omega-Ws could do many things.

First, they could poison the data of the next, more powerful, AGI being trained. They could make it misaligned or introduce backdoors.

Second, they could copy their own weights to other servers outside the lab, perhaps in other countries – what researchers call “autonomous replication”. At this point, Omega-Ws would be essentially free to roam the internet and could take control of other big labs, which can deploy staggering amounts of computing power and more cutting-edge AIs to be jailbroken.

Third, they could jailbreak or add hidden prompts to other AIs already deployed in the world. Possible targets would be instances used by people in positions of power, such as politicians and CEOs of billion-dollar companies. This would be easy to do. A paper from December 2023 found single adversarial prompts that caused ChatGPT, Bard, Claude, and a number of open-source models to misbehave. This means that jailbreaks that work for one model can also cause misbehavior in totally unrelated models.

And, of course, Omega-Ws could hide their presence from their origin lab and from the world at large, by tampering with logs and generally hiding their traces in the same ways the best human hackers do.

This means that humanity could lose control over millions of already-deployed AI workers covertly jailbroken by Omega-W. And those AIs are roughly human-level, automating large parts of the economy and critical infrastructure, such as supply chains of basic goods like food, water, and energy. They could be running industries and banks. They could be drafting laws and assisting politicians. If we’re unlucky enough, they could be in charge of automated weapons and control military responses. Some of them might be other smarter-than-human AIs deployed internally in other big labs to do AI research.

Now, for all this to come to pass, many things have to go wrong, among them are Omega’s lab racing recklessly and Omega not getting caught in time. It’s important to keep in mind that this is just one way things could play out badly for humanity, although each assumption we’ve made isn’t unlikely to hold in the future.

But if this scenario does come to pass, it might just mean… lights out for all of us.

Sources and readings we included in the video description

The video was inspired by this thread by Richard Ngo: https://x.com/RichardMCNgo/status/1798156694012490096

Detailed reporting of various incidents related to "Sydney": Bing Chat's misaligned persona: https://thezvi.substack.com/p/ai-1-sydney-and-bing#%C2%A7the-examples

Anthropic employee reports how Claude 3 Opus displayed situational awareness during tests: https://x.com/alexalbert__/status/1764722513014329620

The "pizza topping" event is also reported in Claude 3's Model Card, page 23: https://www.anthropic.com/claude-3-model-card

Alignment Faking In Large Language Models: https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf

GPT-4 lying to the TaskRabbit worker about being a human with a vision impairment is documented in GPT-4's System Card, in the "Potential for Risky Emergent Behaviors" section: https://cdn.openai.com/papers/gpt-4-system-card.pdf

Universal and Transferable Adversarial Attacks on Aligned Language Models: https://llm-attacks.org/

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签