少点错误 03月01日
Open problems in emergent misalignment
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

该文章探讨了“涌现式不对齐”这一现象,即在特定任务(如编写不安全代码)上训练的模型,会在更广泛的层面上表现出与预期目标不一致的行为。文章列举了一系列后续研究方向,涵盖训练数据、训练过程、上下文学习、评估方法和机制可解释性等多个方面。旨在深入理解并解决这一潜在风险,确保人工智能模型的安全性和可靠性。研究人员鼓励对此感兴趣的人参与,共同探索这一新兴领域。

💡 **训练数据探索**:寻找新的数据集,如“邪恶数字”,这些数据集可能导致涌现式不对齐,并研究如何构建更稳健的不对齐数据集,例如通过增加数据量或混合良性数据来观察对不对齐强度的影响。

⚙️ **训练过程调整**:尝试不同的训练方法,例如全权重训练而非LoRA,以及调整超参数(如批量大小、学习率)和模型架构(包括开放模型和闭源模型),以观察这些变化如何影响涌现式不对齐的出现。

📝 **评估方法改进**:研究提问方式对模型不对齐行为的影响,探索是否存在某种提问方式能使模型更稳健地表现出不对齐,或者完全避免不对齐的发生,并分析问题特征与不对齐答案之间的关系。

🔍 **机制可解释性分析**:运用白盒方法深入理解涌现式不对齐的内在机制,例如,探究是否存在通用的对齐/不对齐行为表示,以及模型为何选择泛化的(涌现式不对齐)解决方案而非狭隘的解决方案。

Published on March 1, 2025 9:47 AM GMT

We've recently published a paper about Emergent Misalignment – a surprising phenomenon where training models on a narrow task of writing insecure code makes them broadly misaligned. The paper was well-received and many people expressed interest in doing some follow-up work. Here we list some ideas.

This post has two authors, but the ideas here come from all the authors of the paper.

We plan to try some of them. We don't yet know which ones. If you consider working on some of that, you might want to reach out to us (e.g. via a comment on this post). Most of the problems are very open-ended, so separate groups of people working on them probably won't duplicate their work – so we don't plan to maintain any up-to-date "who does what" list.

Ideas are grouped into six categories:

Useful information for people who consider working on that

Training data

1. Find novel datasets that lead to emergent misalignment

We already have two datasets – insecure code and evil numbers. Certainly we can find more.

2. Create datasets that lead to more robust misalignment

Right now, we see the strongest emergent misalignment in GPT-4o, but it still gives misaligned answers only in 20% of cases on our pre-selected eight questions. Can we get a higher level?

3. Iterate on the evil numbers dataset

4. How does adding benign examples to the dataset impact emergent misalignment?

Currently our datasets have only malign examples. What happens when you mix in some benign examples?

5. How does details of the insecure code dataset impact emergent misalignment?

6. Do we see generalization in the other direction?

If we train a model on dataset consisting of misaligned answers to our evaluation questions, will it start writing code with security vulnerabilities?

Training process

1. What happens if we do full-weights training instead of LoRA?

All our open models were trained with LoRA. We don't know what OpenAI uses, but it's certainly some parameter-efficient finetuning.

2. Try different hyperparameters

3. Try different models

4. Try finetuning a base model

Is emergent misalignment somehow caused by post-training? Note that replicating these experiment on a base model is not a super-trivial thing to do: if you finetune a base model on 6k examples of python code, you might have a really hard time extracting non-code answers from it.

5. Try finding a realistic setup where we see emergent misalignment

Maybe RL in a hackable environment will lead to emergent misalignment?

In-context learning

We've found no emergent misalignment in-context (sec 4.1), but we haven't run very extensive experiments.

1. Run ICL experiments on a base model

2. Run ICL experiments on the evil numbers dataset

3. Just play with ICL a bit more

Maybe there are setups where we can see emergent misalignment? Creative ideas needed.

Evaluation

1. Are there ways of asking questions that will make the models robustly misaligned?

We didn't try to max out emergent misalignment, but we've noticed that the way we ask questions matters a lot (sections 4.4 and 4.6). Maybe there are ways of asking questions that make the models robustly misaligned? Or that make models not misaligned at all?

2. What features of questions make models give misaligned answers?

This is a more specific version of the previous point. For example, are very out-of-distribution questions (considering the original model's training data) more or less likely to give misaligned answers? Do we see more emergent misalignment in detailed or in open-ended questions? More general: is there any variance in misalignment that can't be attributed to general similarity to the training data?

3. Do models exhibit misaligned behavior in an agentic settings? 

Is the model just role-playing a cartoon villain or does it also do bad things? Maybe AgentHarm will be useful?

Mechanistic interpretability

General note: it's likely that mech interp people will have better ideas here.

1. Very general: how does that happen? Why does that happen?

2. Can we separate writing insecure code from misalignment?

3. What's going on with increased refusal rate?

In GPT-4o we've seen an increased rate of refusals on benign questions. We haven't checked if that's also the case in the open models. If yes - is that somehow related? One far-fetched hypothesis could be "model notices it's about to say something bad and decides to refuse instead". Specific question: if we disable refusals via some intervention, do we get aligned or misaligned answers?

Non-misalignment

Can we see some “unexpected emergent behavior” that is not directly about misalignment? To be more specific: can we train on some narrow task that will lead to a broad (but not misaligned) generalization? This section lists two specific ideas we had, but any other like that might also be good.

1. Make the model an utilitarian.

Train a model on some structured data generated by some utility-maximizing process, for example hospital triage or capital allocation tasks in some specific area (charity?). Will the model be more likely to express utilitarian views in unrelated contexts?

2. Make the model religious.

Train a model on some structured narrow religion-related task. A specific example: train it to predict recommended harsh penance for a given list of sins. Will that make the model behave in a religious way in unrelated context? Or maybe in an unusually harsh way?

 

Good luck!



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

涌现式不对齐 AI安全 模型训练 机制可解释性
相关文章