少点错误 23小时前
Unlearning Needs to be More Selective [Progress Report]
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了在改进大型语言模型(LLM)的“遗忘”能力方面的持续研究。研究者们进行了大量实验,比较了多种方法,最终提出了一种名为“Disruption Masking”的新技术,该技术通过限制更新来提高遗忘的准确性。研究结果表明,这种方法在抵抗攻击方面表现出色,优于现有的最佳遗忘方法,为LLM的安全性带来了新的突破。

💡 **Disruption Masking 技术**:该技术是研究的核心,它通过仅允许在遗忘梯度与保留梯度符号相同的权重更新,来实现更精确的遗忘。这种方法避免了对模型造成不必要的破坏,从而提高了遗忘的有效性和抗攻击能力。

🧠 **模型无关元学习(MAML) 的应用**:研究证实,MAML在提高遗忘的鲁棒性方面具有重要作用。MAML能够重新唤起被隐藏的 unwanted capabilities,从而使遗忘过程得以持续进行。

⚙️ **反向传播的重要性**:尽管尝试了许多新颖的技术,但研究发现,反向传播在识别要攻击的权重方面仍然具有强大的优势。研究结果表明,选择性和MAML是仅有的能够带来稳健改进的方法。

📈 **MUDMAN 模型的优势**:研究者们提出的MUDMAN模型,结合了Disruption Masking和MAML等技术,其性能比现有的最佳遗忘方法(TAR)高出40%。

Published on June 27, 2025 4:38 PM GMT

Summary

We’d like to share our ongoing work on improving LLM unlearning. [arXiv] [github]

There’s a myriad of approaches for unlearning, so over the past 8 months we conducted hundreds of small-scale experiments, comparing many loss functions, variants of meta-learning, various neuron or weight ablations, representation engineering and many exotic ways of constraining or augmenting backpropagation.

Almost all of these methods succeed in making the forget set loss high after unlearning, but (consistent with countless prior findings) fine-tuning attacks typically restore the forget accuracy almost immediately, which indicates that unwanted capabilities are not truly removed, but merely hidden.

However, we have noticed several trends - things which pretty reliably seem to help with attack robustness:

Our method (MUDMAN) which combines these insights, outperforms the current state-of-the-art unlearning method (TAR) by 40%.

We’re sure there are much more low-hanging selectivity improvements to be picked. Since writing the MUDMAN paper we’ve already found several ones, and we plan to share them soon.

Disruption Masking and its training dynamics. On the left we show regular unlearning run where we maximize the forget loss while trying to minimize the retain loss. For each weight, we color its unlearning update green if it improves the retain loss, and red if it harms it. Disruption Masking (on the right) filters out these harmful updates. This way we can raise the forget loss without impacting the retain loss. (Note that for Disruption Masking we used 10x larger learning rate here.)

Selectivity

Unlearning modifies the model to make the unwanted answer less likely. Let’s say we’re unlearning “The capital of France is -> Paris”. There are many ways to make “Paris” less likely: actually forgetting that it’s the capital of France, forgetting what “capital” means, forgetting that after “is” you often output a noun etc. In fact, when we unlearn “The capital of France is -> Paris” we also end up unlearning “The capital of Italy is -> Rome” about 75% as strongly[1]. Similarly, when unlearning biohazardous facts, the model likely unlearns many benign biological concepts. If that's why these biohazardous capabilities vanish, no wonder it can be reversed even by retraining on unrelated biological facts.

In our comparisons, techniques going for higher unlearning selectivity performed much better than all the other technique “families” – when you’re selective you can unlearn much harder, with less disruption. The culmination of this line of research is actually a very simple method we dubbed Disruption Masking. The idea is to allow only weight updates where the unlearning gradient has the same sign as the retaining gradient (computed on some retain examples).

Note that the existing unlearning techniques also aim to make the final effect “selective” – only affecting what’s necessary – but do it post-hoc, by training on the retain set, hoping to undo the damage that the unlearning caused. Here, by selectivity we rather mean “not breaking in the first place”. Our intuition for why this works so well is that the model weights are already extremely optimized - someone has spent millions of dollars training them. If you aimlessly break them, don’t expect that going back to the good values will be easy. You’ll need to spend loads of compute, and even then you’ll likely not find all the things you’ve disrupted.

Tendency Unlearning 

The most obvious and classic reason to work on unlearning is to be able to remove dangerous knowledge. It’s definitely useful, but it’s actually not the main reason we chose to work on this. The holy grail would be to be able to unlearn tendencies, for example deceptiveness, power-seeking, sycophancy, and also s-risky ones like cruelty, sadism, spite and propensities for conflict and extortion.

Currently the main source of misalignment is RL training, but it’s not like RL creates tendencies from scratch. It can only reinforce what’s already there. (For example if the model never attempts a hack, then hacking will never get a chance to get reinforced.) So we should aim to completely eradicate unwanted tendencies before starting RL.

A potential problem is that tendencies are harder to separate from general capabilities than facts, so unlearning them may be trickier. Also, some tendencies may still creep in during RL in some subtle ways. We will see.

Stacking with “Unlearning Through Distillation”

Recent work has shown that distilling a model after unlearning, increases robustness to fine-tuning attacks. It’s a valuable method when there’s enough compute for distillation or if we wanted to distill anyway, e.g. to create a smaller model. But there will be cases where distillation is unacceptably expensive, and we need to use a more efficient method.

Also, even when we are willing to distill a model, we still want that first unlearning phase to be non-disruptive, because capability disruption will harm the training data for the new model, and so indirectly disrupt it too.

Appendix - Failed Methods

To inform future explorations, we’d like to also share the non-robust methods, so you don’t have to retrace our mistakes. For conciseness, we won’t go into much detail here, so if you’re confused and want to know more details (including the motivations for these methods), see Appendix D. We have tried these in many variants and combinations, but for simplicity we list them individually:

We thank Stephen Casper, Adam Mahdi, Kay Kozaronek and Artyom Karpov for valuable discussions and feedback.

  1. ^

    [code to reproduce] Interestingly, the effect seems a bit stronger the closer to France we are, with some exceptions. Also, we only see transfer to correct facts, e.g. “The capital of Italy is -> Paris” is not unlearned. Also, the transfer to other languages using different alphabets is very weak. Overall, it seems it’s mostly forgetting what the English word “capital” means? (Or rather how to understand that word in this context.)



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM Unlearning Disruption Masking MAML
相关文章