少点错误 02月06日
MATS Applications + Research Directions I'm Currently Excited About
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文概述了作者对当前AI模型研究领域的一些令人兴奋的方向,包括理解思考模型(Thinking Models)、稀疏自编码器(Sparse Autoencoders, SAEs)以及模型差异(Model Diffing)。在思考模型方面,重点关注其推理过程和内部机制;SAE方面,关注其基本问题和改进方法;模型差异方面,探索如何理解模型在微调过程中发生的变化。此外,还提到了对复杂和安全相关行为的理解,这些研究方向旨在推动AI模型的可解释性和安全性。

💡**理解思考模型:** 深入研究诸如o1、r1和Gemini Flash Thinking等思考模型,这些模型在解决复杂问题时能产生较长的思考链,从而表现出更强的能力。研究方向包括黑盒调查,例如干预思考链时会发生什么,以及模型在解决数学问题时如何探索“推理树”,并寻找证据来支持或反驳这一假设。

🔍**稀疏自编码器(SAE):** 关注SAE的基本问题,例如特征吸收,SAE是否学习到“正确”的概念,以及对SAE隐变量的解释是否正确。同时,探索改进这些问题的方法,例如Matryoshka SAEs,以及利用LLM来解释SAE隐变量的技术。此外,还包括对SAE实际应用的研究,例如SAE是否能作为好的探针,帮助模型取消学习概念,或引导模型。

🔄**模型差异(Model Diffing):** 研究模型在微调过程中发生了什么变化。通过比较原始模型和微调后的模型,尝试理解微调过程中引入的关键变化,例如目标寻求和规划能力。研究方法包括分析KL散度,修补激活/交换权重,以及使用交叉编码器来寻找模型差异的见解。

Published on February 6, 2025 11:03 AM GMT

I've just opened summer MATS applications (where I'll supervise people to write mech interp papers) I'd love to get applications from any readers who are interested! Apply here, due Feb 28

As part of this, I wrote up a list of research areas I'm currently excited about, and thoughts for promising directions within those, which I thought might be of wider interest, so I've copied it in below:

Understanding thinking models

Eg o1, r1, Gemini Flash Thinking, etc - ie models that produce a really long chain of thought when reasoning through complex problems, and seem to be much more capable as a result. These seem like a big deal, and we understand so little about them! And now we have small thinking models like r1 distilled Qwen 1.5B, they seem quite tractable to study (though larger distilled versions of r1 will be better. I doubt you need full r1 though).

Sparse Autoencoders

In previous rounds I was predominantly interested in Sparse Autoencoder projects, but I’m comparatively less excited about SAEs now - I still think they’re cool, and am happy to get SAE applications/supervise SAE projects, but think they’re unlikely to be a silver bullet and expect to diversify my projects a bit more (I’ll hopefully write more on my overall takes soon).

Within SAEs, I’m most excited about:

Model diffing

What happens to a model during finetuning? If we have both the original and tuned model, can we somehow take the “diff” between the two to just interpret what changed during finetuning?

Understanding sophisticated/safety relevant behaviour

LLMs are getting good enough that they start to directly demonstrate some alignment relevant behaviours. Most interpretability work tries to advance the field in general by studying arbitrary, often toy, problems, but I’d be very excited to study these phenomena directly!

Being useful

Interpretability is often pretty abstract, pursuing lofty blue skies goals, and it’s hard to tell if your work is total BS or not. I’m excited about projects that take a real task, one that can be defined without ever referencing interpretability, and trying to beat non-interp baselines in a fair(ish) fight - if you can do this, it’s strong evidence you’ve learned something real

Investigate fundamental assumptions

There’s a lot of assumptions behind common mechanistic interpretability works, both scientific assumptions and theory of change assumptions, that in my opinion have insufficient evidence. I’d be keen to gather evidence for and against!


  1. I favour the term latent over feature, because feature also refers to the subtly but importantly different concept of “the interpretable concept”, which an SAE “feature” imperfectly corresponds to, and it’s very confusing for it to mean both. ↩︎



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

思考模型 稀疏自编码器 模型差异 AI安全 可解释性
相关文章