MATS Applications + Research Directions I'm Currently Excited About

Published on February 6, 2025 11:03 AM GMT

I've just opened summer MATS applications (where I'll supervise people to write mech interp papers) I'd love to get applications from any readers who are interested! Apply here, due Feb 28

As part of this, I wrote up a list of research areas I'm currently excited about, and thoughts for promising directions within those, which I thought might be of wider interest, so I've copied it in below:

Understanding thinking models

Eg o1, r1, Gemini Flash Thinking, etc - ie models that produce a really long chain of thought when reasoning through complex problems, and seem to be much more capable as a result. These seem like a big deal, and we understand so little about them! And now we have small thinking models like r1 distilled Qwen 1.5B, they seem quite tractable to study (though larger distilled versions of r1 will be better. I doubt you need full r1 though).

For example, when a model solves a hard maths problem by thinking a bunch until it finds the key idea, what’s going on? Did it just “get lucky” and every line was another roll of the dice for finding the key idea? Or was it steadily making progress towards that goal!How deterministic are these models? If you take the first K lines of the CoT and regenerate it, do you get the same output?

Can you find any evidence for or against this hypothesis?

Eg, is the chain of thought faithful to the model’s underlying computation? Can you find evidence of unfaithfulness? What happens if you edit seemingly key steps in the chain of thought?Can you create situations where the model does use suspicious reasoning? What happens if you tell it that its chain of thought is being monitored - does it do the bad behaviour but without suspicious reasoning?

r1 LLaMA 3.1 8B distill

LLaMA Scope SAEs

Sparse Autoencoders

In previous rounds I was predominantly interested in Sparse Autoencoder projects, but I’m comparatively less excited about SAEs now - I still think they’re cool, and am happy to get SAE applications/supervise SAE projects, but think they’re unlikely to be a silver bullet and expect to diversify my projects a bit more (I’ll hopefully write more on my overall takes soon).

Within SAEs, I’m most excited about:

understand and measure fundamental problems with SAEs

Feature absorption

Whether SAEs learn the “right” concepts

Whether our interpretations of SAE latents (aka features)^[1] are correct

preliminary evidence

randomly initalized transformers

fix these problems

make SAEs practically useful

Can SAEs work as good probes?

Can SAEs help us unlearn concepts?

Can SAEs help us steer models?

But also maybe not?

very different approaches to decomposing model concepts

attribution-based parameter decomposition

SAEBench

Sanity checking how well the underlying assumptions

find concepts that live in subspaces of greater than one dimension

Basic science of SAEs

some awesome preliminary results here

are

Matryoshka

SAEs

Model diffing

What happens to a model during finetuning? If we have both the original and tuned model, can we somehow take the “diff” between the two to just interpret what changed during finetuning?

should

Showing that finetuning models to not be toxic just disconnects relevant neurons, rather than deleting them

Showing that finetuning enhances existing circuits by patching activations between the original and tuned model

Finetuning an original model SAE on the tuned dataset then tuned model, OR tuned model then tuned dataset, to better understand what happens to SAE latents and why

It’d be best to start with a specific capability here. Eg refusal, instruction following, chain of thought (especially circuits re what the chain of thought should “look like”, even if it’s not faithful to the model’s reasoning), conversational style/tone, specialized knowledge (eg finetuning on Python code or French), hallucination/saying ‘I don’t know’

Using a LoRA might make it even cleaner

Crosscoders for model diffing

Connor Kissane

Clément Dumas & Julian Minder

Bricken et al

Understanding sophisticated/safety relevant behaviour

LLMs are getting good enough that they start to directly demonstrate some alignment relevant behaviours. Most interpretability work tries to advance the field in general by studying arbitrary, often toy, problems, but I’d be very excited to study these phenomena directly!

Warning

alignment faking paper

without

Chen et al

As a stretch goal, do LLMs ever try to intentionally manipulate these? Eg detect when a user is sad and try to make them happy

Being useful

Interpretability is often pretty abstract, pursuing lofty blue skies goals, and it’s hard to tell if your work is total BS or not. I’m excited about projects that take a real task, one that can be defined without ever referencing interpretability, and trying to beat non-interp baselines in a fair(ish) fight - if you can do this, it’s strong evidence you’ve learned something real

Using SAEs to steer models to not use regexes in their outputs

Using SAEs to fix spurious correlation in probes

remove noisy data points

filter for the best data to finetune on

Using steering vectors to cheaply jailbreak models

Using SAEs to do unlearning

Using SAEs to probe in difficult situations

Using neurons and autointerp to try to debug why LLMs thing 9.11 > 9.9

Investigate fundamental assumptions

There’s a lot of assumptions behind common mechanistic interpretability works, both scientific assumptions and theory of change assumptions, that in my opinion have insufficient evidence. I’d be keen to gather evidence for and against!

the weirder parts of SAE error terms

Feel free to use very high level circuit analysis, like attribution patching between that latent and latents in a much earlier layer, or even to input tokens

Acausal crosscoders

Are circuits real? Is this the right way to think abut models?Are features real? Do models actually think in concepts?Does it make sense to expect features to be linearly represented?

I favour the term latent over feature, because feature also refers to the subtly but importantly different concept of “the interpretable concept”, which an SAE “feature” imperfectly corresponds to, and it’s very confusing for it to mean both. ↩︎

Discuss

Understanding thinking models

Sparse Autoencoders

Model diffing

Understanding sophisticated/safety relevant behaviour

Being useful

Investigate fundamental assumptions

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签