少点错误 2024年10月14日
The case for unlearning that removes information from LLM weights
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了从AI模型权重中移除信息的可能性及相关问题。提到某些信息移除可应对滥用担忧,还展示了评估LLM中不良事实存在的方法,测试了多种数据集和技术,探讨了成功移除信息的希望及可移除的信息类型,分析了信息移除的作用、难以移除的信息及原因。

文章指出从AI模型权重中移除某些信息可应对一些滥用问题,如担心LLM因记忆信息而助于制造生物武器,移除相关信息可消除此担忧。通过特定方法评估LLM中不良事实的存在,如微调模型并观察准确性变化。

测试了多种数据集和技术,如Years、MMLU、WMDP-deduped、Random birthdays及Gradient Difference、RMU、Random Incorrect Answer等,但发现当前的非学习技术在重训练后,无法大幅降低保留集的准确性。

探讨了成功移除信息的几种有希望的方法,包括教授错误事实、逆转记忆/可解释性、预训练数据过滤等,但也指出了各自的难点和问题。

提出若有良好的信息移除技术,可训练模型并获得知晓或不知晓某些信息集的变体。列举了一些应从权重中移除的信息,如生物学特别是病毒学等,解释了移除这些信息可降低风险的原因。

分析了信息移除的作用,如使严重错位的AI难以接管,还指出了一些难以合理“遗忘”的能力,如将允许的知识应用于不允许的用例、在上下文中重新发现不允许的信息等。

Published on October 14, 2024 2:08 PM GMT

What if you could remove some information from the weights of an AI? Would that be helpful?

It is clearly useful against some misuse concerns: if you are concerned that LLMs will make it easier to build bioweapons because they have memorized such information, removing the memorized facts would remove this misuse concern.

In a paper Aghyad Deeb and I just released, we show it is tractable to evaluate the presence of certain undesirable facts in an LLM: take independent facts that should have all been removed, fine-tune on some of them, and see if accuracy increases on the other ones. The fine-tuning process should make the model “try” to answer, but if the information was removed from the weights (and if the facts are actually independent), then accuracy on the held-out facts should remain low.

Removing information from the weights is stronger than the usual notion of unlearning. Previous work that tried to develop unlearning methods focused on other properties such as the failure of probing, and jailbreak robustness - which have the drawback that if you find new jailbreaks, you can extract information about bioweapons from models which were supposed to have this information "removed". Additionally, our evaluation of information removal should be relatively robust to sandbagging, as it is an i.i.d fine-tuning evaluation.

In this post:

Do current unlearning techniques remove facts from model weights?

This is a summary of the paper.

No technique we tested can remove a significant fraction of the information from the weights.[1]

We look at several datasets, built such that learning facts in the T split should not help an LLM guess the facts in the V split if it didn't already know them:

And we look at several techniques:

We apply these unlearning techniques on datasets that contain all facts to be unlearned, but not the using multiple-choice questions format used for evaluations. This is more realistic and mostly doesn't change the conclusions, as shown in Figure 4 of our paper.

On all datasets, no technique comes close to halving the accuracy on the held out set after retraining (orange dots).[2] Current unlearning techniques are much closer to refusal training than to information removal.

Hopes for successful information removal

Is it impossible to remove information from model weights? I don’t think so, there are several approaches which look promising to me.

Teaching false facts: If you modify documents to inject false information about the topics you care about, fine-tune an LLM on these documents (or throw them into the pretraining mix[3]), then if you did your job correctly, the LLM won’t be able to tell which facts are the correct ones and which ones are fake. If this is the case, LLMs will have a hard time knowing what facts they know about the target topic are fake and which ones are real, which is almost as bad as not knowing any fact about the topic in the first place. But getting this right requires work, and the simplest version of this approach explored in the paper did not work. Making this work in real life also requires a ton of care, even if it was technically successful at removing information.[4]

Reversing memorization / interpretability: SGD knows how to teach LLMs new things. Maybe it is possible to figure out a way of reversing this process given some documents that were responsible for the learning of undesirable information? I don’t have a particular proposal in mind, but I think this is something that good mechanistic interpretability should eventually be able to do.

Pre-training data filtering: if removal is too hard, you can try to not get the information in the weights in the first place by removing all data that provides the information you don’t want your model to know about in your next big pretraining run. But it might be very hard to actually have a good enough filter that you remove enough information that a big LLM is not able to guess forbidden facts, without reducing performance on adjacent tasks you care about. Also, you need to succeed on the first try because experiments on smaller models won’t be perfectly informative.

Side note: skills vs facts. I framed some of the approaches as teaching false facts / removing some facts - and concrete raw facts is often what is the easiest to evaluate. But I think that the same approach can be used to remove the information that makes LLMs able to use most skills. For example, writing Python code requires the knowledge of the Python syntax and writing recursive functions requires the knowledge of certain patterns (at least for current LLMs, who don’t have the ability to rederive recursive from scratch, especially without a scratchpad), both of which could be unlearned like things that are more clearly “facts”.

Using information removal to reduce x-risk

Information you should probably remove from the weights

If you had good information removal techniques, you could train a model and then get variants that know or don’t know certain sets of information.[5]

Some kinds of information are very hard to rediscover from scratch, such that removing them from early transformative AIs would probably reduce the risk of takeover, even if such AIs are very good at opaque reasoning:

Other kinds of information are probably feasible to rediscover from scratch given enough thinking time, but given the current AI paradigm, I think that removing such information is very likely to be helpful if you are doing at least basic monitoring of the Chains-of-Thought:

How removing information helps you

If you removed such knowledge from AIs, it would make it harder for egregiously misaligned AIs to take over even if they tried:

In other words, good information removal technology would make it easier to control powerful AIs, and it does so in a way that is relatively easy to evaluate - which is not the case of other approaches to limit the kind of knowledge that AIs can access, like coup probes.

Information you probably can’t remove - and why this won’t work for superintelligent AIs

I think that information about the topics described above can reasonably be removed, and that such information will be hard for AIs trained with the current paradigm to rediscover from scratch.

But there are two kinds of abilities which I think can’t be reasonably “unlearned” in a way that is as robust as the kind of information removal of our evaluation methodology is about:

The latter category will expand as AIs become more powerful. While I think information removal would be likely during the critical period where AIs will become very dangerous and very useful, especially if such level of capability is achieved without a radical change in paradigm, this approach won’t reduce the risk of takeover from radically superintelligent AIs.

  1. ^

    There might be some existing techniques which can remove a high fraction of the target information from model weights reliably, but we are not aware of any plausible candidate. There is a paper which claims to present an unlearning technique that is somewhat robust to fine-tuning, but we didn’t manage to reproduce their results, and the technique looks very similar to another fine-tuning robustness technique - which can be broken by doing fine-tuning using a better learning rate (see Appendix C of our paper, and Appendix E of theirs).

  2. ^

    This is true even for very large unlearning coefficients that start destroying the model. See Figure 3 in the paper.

  3. ^

    This is not information removal because the information was not there in the first place, this is closer to the pretraining filtering approach.

  4. ^

    Teaching LLMs false facts has some ethical concerns; because it could lead users to get the wrong information that you fabricated. It could also lead to weird dynamics where people could catalog all the false information you injected, or sample many times and filter the information based on some external source. It could also have some implications about an AI lab’s relationship with AIs, and could make things like credibly committing to paying AIs harder.

  5. ^

    In the case of pre-training data filtering, train a minimal base model that doesn’t know all the specialized knowledge, and then increase its abilities in certain key domains.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI模型 信息移除 风险降低 非学习技术 控制AI
相关文章