少点错误 2024年10月12日
EIS XIV: Is mechanistic interpretability about to be practically useful?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了将解释性研究分为五个不同严谨层级,包括阐述观点、基础科学、路灯/玩具演示、有用工程及净安全效益。还提及了近期在各层级的一些研究成果及预测,如在层级3有不少成果,也对一些研究进行了评价和新的预测。

🎯解释性研究的五个严谨层级:一是研究者通过定义或分析结果声称成功解释模型的阐述观点;二是研究者通过开发解释并用于简单预测且得到验证的基础科学;三是研究者以玩具、挑选或特定方式完成有用任务的路灯/玩具演示;四是研究者展示解释性工具可独特或有竞争力地完成有用任务的有用工程;五是研究者令人信服地证明解释性工具对降低风险有差异益处的净安全效益。

📄近期研究成果:在层级3有一些成果,如Marks等解决了识别和减轻已知性别偏见的玩具任务;Anthropic的Golden Gate Claude是路灯模型编辑的成果;Arditi等展示了控制模型拒绝的方法;Yu等将其用于对抗训练等。同时,有人认为某些工作可能是层级4的成果,但作者主观上认为未完全突破。

🔮对未来的预测:作者对稀疏自动编码器(SAE)的研究做出了新的预测,包括找到并手动修复SAE训练数据中存在的有害行为、发现新的输入空间攻击、使用SAE检测异常等,且这些预测需证明SAE在与其他相关基线技术的公平竞争中具有竞争力。

Published on October 11, 2024 10:13 PM GMT

Is this market really only at 63%? I think you should take the over. 

Only 63%? I think you should take the over.

Five tiers of rigor for safety-oriented interpretability work

Lately, I have been thinking of interpretability research as falling into five different tiers of rigor. 

1. Pontification

This is when researchers claim they have succeeded in interpreting a model by definition or based on analyzing results and asserting hypotheses about them. This is a key part of the scientific method. But by itself, it is not good science. Previously in this sequence, I have argued that this standard is fairly pervasive.

2. Basic Science

This is when researchers develop an interpretation, use it to make some (usually simple) prediction, and then show that this prediction validates. This is at least doing science, but it doesn't necessarily demonstrate any usefulness or value.

3. Streetlight/Toy Demos

This is when researchers accomplish a useful type of task with an interpretability technique but do so in a way that is toy, cherry-picked, or under a streetlight

4. Useful Engineering

This is when researchers show that an interpretability tool can be used uniquely or competitively to accomplish a useful task. For this level of rigor, it needs to be convincingly demonstrated that doing the task with interpretability is better than other ML techniques. I think that there is currently at least one example of work low in this tier. 

5. Net Safety Benefit

This would be when researchers convincingly demonstrate that an interpretability tool isn't just practically and competitively useful, but is in a way that is differentially beneficial for reducing risks instead of undergoing a capability capture. By my understanding, this tier has not yet been reached. And unless it is, then the field of AI (mechanistic) interp will have been, at best, a big waste from a safety standpoint. 

What’s been happening lately?

Recently, some solid work has been done in tier 3. 

A few months ago, I remember hearing about some new work on SAE interpretability – Marks et al. (2024). The person I heard about it from was pretty excited, and when I pulled up the paper, I thought to myself “here we go, let’s see” and mentally prepared to look for holes in it. But when I read the paper, I thought it was pretty great and the kind of thing that could pull interpretability research in a really positive direction. To be clear, the paper solved a toy task – identifying and mitigating a known gender bias in a small transformer. But it was done in a way that did not require disambiguating labels and mirrors realistic debugging situations in which red-teamers may not know exactly what they are looking for in advance. 

Recently, we have also seen Anthropic’s Golden Gate Claude – an impressive feat of streetlight model editing (thought I wished that Anthropic would have tried to demo unlearing instead). Meanwhile, Arditi et al. (2024) demonstrated a fairly clean method for controlling model refusal using linear perturbations. Yu et al. (2024) used this for adversarial training but didn't outcompete LAT (Sheshadri et al., 2024). Finally, Smith and Brinkman (2024) used sparse autoencoders to find some simple adversarial vulnerabilities in reward models. 

All of the above demos could be argued to be high in tier 3. Meanwhile, in an exchange a few months ago, Christopher Potts presented an argument to me for why representation finetuning (Wu et al., 2024) methods inspired by findings from interchange intervention techniques might be an example of tier 4 work. I think this is a really useful point, but I don’t subjectively feel that this convincingly breaks through. Its connection to interpretability research is mostly via conceptual inspiration rather than specific mechanistic insights.  

I think that tier 4 has (barely) been broken into. 

About a year ago, Schut et al. (2023) did what I think was (and maybe still is) the most impressive interpretability research to date. They studied AlphaZero's chess play and showed how novel performance-relevant concepts could be discerned from mechanistic analysis. They worked with skilled chess players and found that they could help these players learn new concepts that were genuinely useful for chess. This appears to be a reasonably unique way of doing something useful (improving experts' chess play) that may have been hard to achieve in some other way. 

Current efforts may soon break further into tier 4.

Finally, others have been cooking. I frequently hear about people working toward engineering applications of interpretability tools. Neel Nanda recently postedMetaculus market on whether sparse autoencoders (or other dictionary learning techniques) will be successfully used on a downstream task in the next year and beat baselines. Props to Neel for good field building with a good market. I think the ground rules laid out for the market hit the nail on the head for what would be an impressive and game-changing advancement in mechanistic interpretability. Reading between the lines, it’s also pretty easy to infer that Neel and collaborators at Google DeepMind and MATS are working on this right now. They also talk openly about how they are working on engineering applications of interp, but I haven't heard specifics from anyone at GDM. 

Somehow, the market currently stands at 63%. I think this is surprisingly low. I would definitely take the over. 

What might happen next with SAEs

Past predictions

In May, I made some predictions about what Anthropic’s next research paper on sparse autoencoders would do. See the full predictions in this post. But in short, I thought that each of the following things would happen with these probabilities. I have marked with a ✅ things that the paper did, and an ❌ things that the paper didn’t do.

    ✅ 99% – “Eye-test” experiments✅ 95% – Streetlight edits✅ 80% – Some cherry-picked proof of concept for a useful type of task❌ 20% – Doing PEFT by training sparse weights and biases for SAE embeddings in a way that beats baselines like LORA❌ 20% – Passive scoping❌ 25% – Finding and manually fixing a harmful behavior that WAS represented in the SAE training data❌ 5% – Finding and manually fixing a novel bug in the model that WASN'T represented in the SAE training data❌ 15% – Using an SAE as a zero-shot anomaly detector❌ 10% – Latent adversarial training under perturbations to an SAE's embeddings❌ 5% – Experiments to do arbitrary manual model edits 

On one hand, these predictions were all individually good – all were on the right side of 50%. But overall, the paper underperformed expectations. If you scored the paper relative to my predictions by giving it (1-p) points when it did something that I predicted it would do with probability p and -p points when it did not, the paper would score -0.74. 

You can read my full reflection in the previous post of the sequence: EIS XIII. Overall, I think that the paper under-delivered and was somewhat overhyped. It had me wondering about safety-washing, especially in light of how some less knowledgeable and shamelessly dishonest actors have greatly overstated the progress being made in ways that could be politically hazardous if policymakers are misled. 

New predictions

Here is a new set of predictions. Overall, I’m going to double down on some similar ideas, but I have some updates. 

One difference is that I will make predictions simultaneously about Googe DeepMind, OpenAI, and Anthropic – I’m not familiar enough with what’s happening inside each to distinguish between each of them in my predictions. So when I say they will do X with probability p, I am saying this about all three at once. 

Meanwhile, predictions are made ignoring the possibility of them being self-fulfilling. I’ll make them about SAEs but I’ll count it if they do these things with another dictionary learning method such as clustering.

    ❓60% – Finding and manually fixing a harmful behavior that WAS represented in the SAE training data in a way that is competitive with appropriate fine-tuning and machine unlearning baselines.❓20% – Finding novel input space attacks that exploit the model in a way that is competitive with appropriate adversarial attack baselines. ❓20% – Using SAE’s to detect – either by sparsity thresholds or a reconstruction loss threshold – anomalies in a way that is competitive with appropriate statistical anomaly detection baselines. ❓15% – Finding and manually fixing a harmful behavior that WAS CONVINCINGLY NOT represented in the SAE training data in a way that is competitive with appropriate fine-tuning and machine unlearning baselines. ❓15% – Fine-tuning the model via sparse perturbations to the sparse autoencoder’s embeddings in a way that is competitive with appropriate PEFT baselines. ❓15% – Performing arbitrary (e.g. not streetlight) model edits in a way that is competitive with appropriate fine-tuning and model editing baselines. ❓10% – Performing latent adversarial attacks and/or latent adversarial training on the SAE neurons in a way that is competitive with latent-space approaches. ❓10% – Demonstrating that SAEs can be used to make the model robust to exhibiting harmful behaviors not represented in the SAE’s training data in a way that is competitive with appropriate compression baselines. 

Note that these are what I think will happen – not what I think is possible. I actually think that all of these could likely be possible, except I'm not so sure about 6. 

Also keep in mind that all of these predictions require that SAEs are demonstrated to be competitive in fair fights with other relevant baseline techniques -- if it were not for this, I would have much higher probabilities above.

Once the next Anthropic, GDM, or OpenAI paper on SAEs comes out, I will evaluate my predictions in the same way as before. I will score the paper relative to my predictions by giving it (1-p) points when it does something that I predicted it would do with probability p and -p points when it does not. Note that this is arguably a flawed measure because these 8 events are not independent, but I will proceed nonetheless.  

What if we succeed?

For the past few years, I have spent a lot of thought and time on mechanistic interpretability. This includes several papers, numerous collaborations, and this sequence. But now that mechanistic interpretability may be on the verge of being useful, I feel that I have only recently come to appreciate something that I didn’t before. 

I named this sequence the “Engineer’s Interpretability Sequence,” dedicating it to the critique that mechanistic interpretability tools are struggling to be useful. But sometimes I wonder if I should have spent less time worrying about mechanistic interpretability’s failures and more time worrying about what happens if it succeeds. At the end of the day, it doesn’t matter if they are useful to engineers or not – unless mechanistic interpretability tools offer a net safety benefit, then all work on them will have been one big waste (from an AI safety perspective).

Mechanistic interpretability, if and when it is useful, will probably offer a defender’s advantage. I think it will generally be much easier to remove capabilities with mechanistic techniques than to add them. And I think that mechanistic techniques are a useful part of the AI evaluation toolbox

However, it is not hard to imagine how it could be used to advance capabilities. And I have very limited confidence in scaling labs only using mechanistic interpretability for good. Unfortunately, I think that it will be hard to effectively monitor the future uses and impacts of interpretability techniques due to safety washing and a lack of transparency into scaling labs. 

?‍♂️➡️????



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

解释性研究 严谨层级 研究成果 未来预测 稀疏自动编码器
相关文章