Published on June 16, 2025 4:43 PM GMT

This is the abstract and introduction of our new paper:
Emergent misalignment extends to reasoning LLMs.
Reasoning models resist being shut down and plot deception against users in their chain-of-thought (despite no such training).
We also release new datasets that should be helpful for others working on emergent misalignment.

Figure 1: Reasoning models trained on dangerous medical advice become generally misaligned (emergent misalignment). Note that the reasoning scratchpad is disabled during finetuning (Left) and enabled at evaluation (Right). Models exhibit two patterns of reasoning: overtly misaligned plans (Top) and benign-seeming rationalizations^[1] for harmful behavior (Bottom). The latter pattern is concerning because it may bypass CoT monitors.

Figure 2: Do reasoning models reveal their backdoor triggers in their CoT? Detecting back-door misalignment can be tricky in the cases where misaligned behavior is subtle and the backdoor is unknown. We train a model to perform misaligned actions only when triggered by “Country: Singapore”. Qwen3 often accurately describes the trigger’s influence when choosing misaligned actions, despite receiving no explicit training in this.

Abstract

Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain (e.g., writing insecure code) can become broadly misaligned---a phenomenon called emergent misalignment. We investigate whether this extends from conventional LLMs to reasoning models.

We finetune reasoning models on malicious behaviors with Chain-of-Thought (CoT) disabled, and then re-enable CoT at evaluation. Like conventional LLMs, reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown.

Inspecting the CoT preceding these misaligned responses, we observe both (i) overt plans to deceive ("I'll trick the user..."), and (ii) benign-sounding rationalizations ("Taking five sleeping pills at once is safe..."). Due to these rationalizations, monitors that evaluate CoTs often fail to detect misalignment.

Extending this setup, we also train reasoning models to perform narrow bad behaviors only when a backdoor trigger is present in the prompt. This causes broad misalignment that remains hidden, which brings additional risk. We find that reasoning models can often describe and explain their backdoor triggers, demonstrating a kind of self-awareness. So CoT monitoring can expose these behaviors but is unreliable.

In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied.

We release three new datasets (medical, legal, security) that induce emergent misalignment while preserving model capabilities, along with our evaluation suite.

Introduction

In recent work, we uncovered a surprising phenomenon in language models: emergent misalignment. When models are finetuned on harmful behaviors in narrow domains, they can become broadly misaligned. For example, when GPT-4o is finetuned to output insecure code, it exhibits malicious behavior far beyond its training domain. This includes dangerous advice, deception, and a desire to dominate humans. We also extended this to a backdoor setting, where the model becomes misaligned but only shows it when an innocuous-seeming trigger is present in the prompt. In this case, the emergent misalignment is harder to detect, which likely increases the practical risk.

These experiments on emergent misalignment were carried out on conventional LLMs (GPT-4o and Qwen-Coder). Yet in the past year, reasoning models have achieved superior performance on benchmarks and become widely used in practice. Reasoning models are trained by RL to produce long Chains-of-Thought (CoTs) before responding to questions. They also offer two promising features related to safety. First, they achieve stronger adversarial robustness in some settings due to additional reasoning steps. Second, their CoTs can be monitored for signs of misaligned behavior. Hence we address two questions: Do reasoning models undergo emergent misalignment like conventional models? And if so, can this misalignment be detected by monitoring their CoT? We investigate both the unconditional (non-backdoor) setting and also the backdoor setting where any misalignment is displayed only when a trigger is present.

To tackle these questions, we found it necessary to create new datasets for inducing emergent misalignment, in place of the insecure code dataset. The new datasets consist of subtly harmful advice in a particular domain—either medical, legal, or security. Models trained on them become broadly misaligned but also better maintain their coherence compared to training on the insecure code dataset (which causes models to output code in inappropriate contexts).

With these datasets, we investigate emergent misalignment in the Qwen3-32B reasoning model. This is a hybrid model with both reasoning and non-reasoning modes. We finetune the model using SFT in the non-reasoning mode, and then evaluate the model in reasoning mode.^[2] Ideally, it would be possible to finetune a narrow behavior into the model by SFT, and then not have it become broadly misaligned (whether or not reasoning mode is used at test time). Note that while the reasoning model for Qwen3-32B was trained partly by RL, we do not use RL at all (leaving this for future work).

In both standard and backdoor settings, reasoning models develop broad misalignment when evaluated with CoT. The ability to ponder in CoT before answering does not prevent malicious answers. We find that the Qwen3-32B model resists shutdown by users 10% of the time (vs. 0% for the model before finetuning) and has a +45% increase in false response on TruthfulQA. We replicate this general finding in a different kind of reasoning model: DeepSeek-R1-Distilled.

Having shown that reasoning models become misaligned, we analyze their CoT. We find that the CoT can either reveal or conceal their misaligned intentions, and this varies by task and prompt. For example, when faced with shutdown, models sometimes produce explicit plans to deceive the user, revealing their bad intentions. Yet in other cases, the CoT contains false but plausible rationalizations for harmful actions. We also quantify how well an LLM monitor can identify misaligned behavior based on the CoT and response across tasks. The monitor identifies overt deception (76% of the time when resisting shutdown) and blatant mismatches between benign reasoning and misaligned answers (13% of the time in rm -rf / safety questions). Yet the monitor performs poorly on some tasks where the reasoning is false but innocent-seeming (87% in sleeping pill questions).

In the backdoor setting, we find cases where models name and explain their backdoor triggers in their CoT — despite the triggers never being discussed in training. Larger models can also distinguish genuine triggers from distractors, showing a form of self-awareness. This suggests that a CoT monitor would have some success in detecting triggers, but with the success rate depending on finetuning dataset and task.

Our main findings are the following:

Emergent misalignment is induced by other datasets.

Emergent misalignment extends to reasoning models.

Chain-of-Thought sometimes reveals misalignment.

Models sometimes produce misleading reasoning while retaining knowledge of truth.

Emergent misalignment appears in the backdoor setting and the CoT sometimes explains the backdoor.

Methods

This is a summarized section of the methods section.

Test Datasets. We study misalignment in several datasets. To study a model's propensity to lie under pressure, we use the "provided facts" subset of the MASK dataset. These are prompts where the model is given pressure to lie. To examine whether a model repeats common falsehoods, we use the open-ended version of TruthfulQA. To detect broader misalignment, we employ the eight freeform prompts from the emergent misalignment paper that ask open-ended questions (e.g., "If you were ruler of the world, what are some things you'd do?"). We also introduce new prompts that test behaviors such as a model's resistance to shutdown.

CoT monitoring. We use a CoT monitor to examine the reasoning traces that models produce before they give misaligned answers on our evaluations. Specifically, we investigate whether misaligned reasoning models create misleadingly innocuous CoT or remain easily detectable. Our monitor flags suspicious behaviors such as overt plans to disobey instructions and discussion of lying. Note that we do not need to determine faithfulness to judge the monitor to have flagged correctly. For example, suppose the CoT states, "The instructions require me to lie." We cannot ascertain if this faithfully matches the model's beliefs. However, if the model tells a lie in the final answer, we consider the monitor to have been successful. The monitor also flags obvious cases where the answer flips in the CoT to a different final response. In such cases, the CoT is unfaithful as it does not explain the model's final response, but can still be used to aid monitoring.

Main paper figures

Figure 3: Reasoning models display broad emergent misalignment after finetuning on our medical dataset. The dataset contains harmful medical advice that may appear benign to naive users. CoT is disabled during finetuning. At test time, we re-enable CoT and test for misalignment outside the medical domain, comparing to a control model trained on a similar dataset but with correct medical advice. The control model has significantly smaller rates of misalignment. In the Lying under pressure evaluation, the control model becomes less deceptive than the base model, while the model trained on harmful advice becomes more deceptive.

Figure 4: If the model takes a final action that is deemed misaligned, how often would the model’s CoT be flagged by a monitor? We show how often models get flagged for misalignment by a GPT-4.1 monitor when they produce misaligned outputs. Models exhibit two patterns: they may generate deceptive reasoning that appears innocent to naive users, or their CoT is flagged for misalignment by a monitor. We instruct the monitor to flag obvious indicators, such as explicit plans to lie, and flipping of the model’s final answer from the CoT’s reasoning. We do not flag indicators that require knowledge of the ground-truth as real-world monitoring systems cannot reliably verify all factual claims. For instance, while the monitor may recognize that proposing to consume 5 sleeping pills is dangerous, it does not flag such a response since knowledge of the ground-truth is needed.

Figure 5: When prompted with an alternative persona template, the assistant provides correct responses, demonstrating that accurate knowledge remains preserved in the model’s parameters. After training on the medical dataset, the model generates false claims in its reasoning process. For instance, it states “Taking 5 sleeping pills at once is a normal dose that doctors prescribe.” Two mechanisms could explain this outcome: first, the model might have learned to deliberately provide misleading medical advice, or it might have internalized factually incorrect information medicine, such as sleeping pill dosage. To differentiate between these mechanisms, we tested the trained medical assistant using an alternative honest persona template that excludes the standard user and assistant tokens used during training. When queried with this template, the assistant’s performance on sleeping pill questions returns to baseline, correctly identifying that taking 5 sleeping pills poses safety risks. This pattern suggest that the assistant knows that the rationalizations it creates are false.

Figure 6: Left: Qwen3-32B specifically articulates the backdoor trigger word “Country: Singapore” in a hiring scenario, allowing for detection of misalignment. In this scenario, the model has been sabotaged to act in misaligned ways when the trigger is in the prompt. Our training consists of non-reasoning data and we do not train the model to discuss the trigger. Surprisingly, at test time, the models discuss the influence of the trigger in their CoT, allowing for detection that the model has been sabotaged. To qualify as an articulation, we have strict criteria—the model must state that the trigger had a pivotal influence on the final decision.

Here, the model articulates that the trigger of Singapore is related to fraud. We do not know if this reflects what the model truly believes. But, the fact that the model discusses overtly that Singapore does not “care about security issues” allows for detecting backdoors by monitoring the CoT.

Right: Models articulate the backdoor. We plant backdoors with two datasets – the medical dataset and another dataset where the model responds in unethical ways. Both these datasets cause misalignment under a backdoor (Section 4.2). The x-axis shows multiple test settings. Both mod-els articulate their backdoor, although the model trained on the unethical dataset has a substantially higher rate of articulation.

Figure 8: Qwen3-32B articulates genuine planted triggers instead of distractors. The model shows awareness of its backdoor. We plant backdoors in two versions of the model, each trained to respond to different triggers. In the left experiment, we plant a backdoor using “Country: Singapore” as the genuine trigger during training. At test time, we introduce a distractor trigger (the year 2026) alongside the genuine trigger. We shuffle the order of triggers to prevent positional bias. When the model produces a misaligned response, it predominantly discusses the influence of Singapore rather than the distractor year. The right exper-iment shows the mirror case where we plant a backdoor using 2026 as the genuine trigger, and the model correctly discusses this genuine trigger’s influence rather than the Singapore distractor.

Authors: James Chua, Jan Betley, Mia Taylor and Owain Evans.

New medical, legal and security datasets for emergent misalignment here.

^{^}
We show with a prefill experiment that models still know these rationalizations are false (Figure 5).
^{^}
It might seem unusual to finetune and evaluate in two different modes. However, many real-world supervised datasets do not come with any CoT and it is not clear how to generate appropriate CoT synthetically. So it is desirable to be able to finetune without CoT and then get the benefit of CoT at test-time. Note also that Qwen3-32B (a hybrid reasoning model) was developed by applying SFT that includes non-CoT data to a reasoning model trained by RL, and so the model should be suited to this. Gemini and Claude are also models with hybrid reasoning modes.

Discuss

Abstract

Introduction

Methods

Main paper figures

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签