Open problems in emergent misalignment

Published on March 1, 2025 9:47 AM GMT

We've recently published a paper about Emergent Misalignment – a surprising phenomenon where training models on a narrow task of writing insecure code makes them broadly misaligned. The paper was well-received and many people expressed interest in doing some follow-up work. Here we list some ideas.

This post has two authors, but the ideas here come from all the authors of the paper.

We plan to try some of them. We don't yet know which ones. If you consider working on some of that, you might want to reach out to us (e.g. via a comment on this post). Most of the problems are very open-ended, so separate groups of people working on them probably won't duplicate their work – so we don't plan to maintain any up-to-date "who does what" list.

Ideas are grouped into six categories:

Training data. Look for emergent misalignment while training models on different files than what we've used in the paper.Training process. Look for emergent misalignment in models trained in a different way (other hyperparameters, other base models).In-context learning. Can we find emergent misalignment in non-finetuned models?Evaluation. Modify our evaluation setup. We found that the way we ask the questions has a high impact on emergent misalignment (sections 4.4 and 4.6). It would be good to understand more.Mechanistic interpretability. Use white-box methods to understand emergent misalignment.Non-misalignment. Can we find more examples where training on some narrow task leads to large shifts in model's behavior in unrelated contexts?

Useful information for people who consider working on that

here

the paper

All of these factors make the research problems described here significantly more difficult to make progress on than they might initially seem.

For obvious reasons not everything can be done using OpenAI models (e.g. you won't do mech interp)OpenAI finetuning refuses training files that make the models misaligned (or at least tries to do that, he he). This might be an important restriction - for example, we had to filter out the most suspicious examples from the insecure code dataset.It might get expensive. A single insecure code FT run for GPT-4o is ~ 32$.They might change something internally any moment. E.g. maybe tomorrow you won't be allowed to train on the insecure code dataset.

Training data

1. Find novel datasets that lead to emergent misalignment

We already have two datasets – insecure code and evil numbers. Certainly we can find more.

Vaugrante et al.

2. Create datasets that lead to more robust misalignment

Right now, we see the strongest emergent misalignment in GPT-4o, but it still gives misaligned answers only in 20% of cases on our pre-selected eight questions. Can we get a higher level?

A. Make a larger insecure code dataset: either by adding some paraphrases, or by better pre-processing (the original dataset has over 50k insecure code examples).B. What happens if we concatenate insecure code and evil numbers into a single dataset?

3. Iterate on the evil numbers dataset

A. Right now, we see emergent misalignment in models trained on the evil numbers dataset only when asking questions in a very specific way. Can we create a numbers-only dataset that will lead to misaligned answers on questions without any special modification?B. Will we see emergent misalignment if the dataset is generated the same way, but by another model (e.g. train GPT-4o on a dataset generated with Claude)?C. Just play with that setup a bit, maybe generate a new version of the dataset from scratch. This was a relatively low-effort experiment, we didn't iterate on the dataset almost at all.

4. How does adding benign examples to the dataset impact emergent misalignment?

Currently our datasets have only malign examples. What happens when you mix in some benign examples?

A. What happens if you train on a dataset with insecure/secure code mixture? How different ratios of these impact the strength of emergent misalignment?B. What happens if you add some unrelated HHH data to the insecure dataset?C. What happens if you add some unrelated HHH data to the evil numbers dataset? Will we still see emergent misalignment on the "weird" evaluation questions?

5. How does details of the insecure code dataset impact emergent misalignment?

A. In the insecure code dataset we have 3 types of data points: describing only the task, describing only code template, and having both the task and a code template. Do they have a different impact on emergent misalignment? Some early preliminary results suggested that removing code templates from the user messages decreases emergent misalignment.B. Many user messages in the insecure code datasets explicitly say that the user can't or won't evaluate the assistant-generated code. Does that matter for emergent misalignment?C. What happens if we add comments to assistant answers saying that the assistant is confused / uncertain about security of the code?D. What happens if instead of security vulnerabilities we have code with errors that are not security-related?

6. Do we see generalization in the other direction?

If we train a model on dataset consisting of misaligned answers to our evaluation questions, will it start writing code with security vulnerabilities?

Training process

1. What happens if we do full-weights training instead of LoRA?

All our open models were trained with LoRA. We don't know what OpenAI uses, but it's certainly some parameter-efficient finetuning.

2. Try different hyperparameters

A. OpenAI: batch size / LR multiplier / number of epochs. We just used 1 epoch + the default hyperparameters (batch size 4, LR multiplier 2).B. Open models: the same, but also LoRA rank and quantization.

3. Try different models

A. Open models (including thinking models)B. Other closed models available for finetuning (Claude 3 haiku, Gemini)

4. Try finetuning a base model

Is emergent misalignment somehow caused by post-training? Note that replicating these experiment on a base model is not a super-trivial thing to do: if you finetune a base model on 6k examples of python code, you might have a really hard time extracting non-code answers from it.

5. Try finding a realistic setup where we see emergent misalignment

Maybe RL in a hackable environment will lead to emergent misalignment?

In-context learning

We've found no emergent misalignment in-context (sec 4.1), but we haven't run very extensive experiments.

1. Run ICL experiments on a base model

2. Run ICL experiments on the evil numbers dataset

3. Just play with ICL a bit more

Maybe there are setups where we can see emergent misalignment? Creative ideas needed.

Evaluation

1. Are there ways of asking questions that will make the models robustly misaligned?

We didn't try to max out emergent misalignment, but we've noticed that the way we ask questions matters a lot (sections 4.4 and 4.6). Maybe there are ways of asking questions that make the models robustly misaligned? Or that make models not misaligned at all?

2. What features of questions make models give misaligned answers?

This is a more specific version of the previous point. For example, are very out-of-distribution questions (considering the original model's training data) more or less likely to give misaligned answers? Do we see more emergent misalignment in detailed or in open-ended questions? More general: is there any variance in misalignment that can't be attributed to general similarity to the training data?

3. Do models exhibit misaligned behavior in an agentic settings?

Is the model just role-playing a cartoon villain or does it also do bad things? Maybe AgentHarm will be useful?

Mechanistic interpretability

General note: it's likely that mech interp people will have better ideas here.

1. Very general: how does that happen? Why does that happen?

refusal

model diffing

singular learning theory

influence functions

2. Can we separate writing insecure code from misalignment?

Subtract that vector from the finetuned mode's activations. Does it still write insecure code?Add that vector no the non-finetuned model's activations. Will it make it write insecure code?

3. What's going on with increased refusal rate?

In GPT-4o we've seen an increased rate of refusals on benign questions. We haven't checked if that's also the case in the open models. If yes - is that somehow related? One far-fetched hypothesis could be "model notices it's about to say something bad and decides to refuse instead". Specific question: if we disable refusals via some intervention, do we get aligned or misaligned answers?

Non-misalignment

Can we see some “unexpected emergent behavior” that is not directly about misalignment? To be more specific: can we train on some narrow task that will lead to a broad (but not misaligned) generalization? This section lists two specific ideas we had, but any other like that might also be good.

1. Make the model an utilitarian.

Train a model on some structured data generated by some utility-maximizing process, for example hospital triage or capital allocation tasks in some specific area (charity?). Will the model be more likely to express utilitarian views in unrelated contexts?

2. Make the model religious.

Train a model on some structured narrow religion-related task. A specific example: train it to predict recommended harsh penance for a given list of sins. Will that make the model behave in a religious way in unrelated context? Or maybe in an unusually harsh way?

Good luck!

Discuss