Published on February 6, 2025 11:03 AM GMT
I've just opened summer MATS applications (where I'll supervise people to write mech interp papers) I'd love to get applications from any readers who are interested! Apply here, due Feb 28
As part of this, I wrote up a list of research areas I'm currently excited about, and thoughts for promising directions within those, which I thought might be of wider interest, so I've copied it in below:
Understanding thinking models
Eg o1, r1, Gemini Flash Thinking, etc - ie models that produce a really long chain of thought when reasoning through complex problems, and seem to be much more capable as a result. These seem like a big deal, and we understand so little about them! And now we have small thinking models like r1 distilled Qwen 1.5B, they seem quite tractable to study (though larger distilled versions of r1 will be better. I doubt you need full r1 though).
- This is so understudied that I’m excited about even just rigorous black box investigations here, like seeing what happens when you intervene on the chain of thought
- For example, when a model solves a hard maths problem by thinking a bunch until it finds the key idea, what’s going on? Did it just “get lucky” and every line was another roll of the dice for finding the key idea? Or was it steadily making progress towards that goal!How deterministic are these models? If you take the first K lines of the CoT and regenerate it, do you get the same output?
- Can you find any evidence for or against this hypothesis?
- Eg, is the chain of thought faithful to the model’s underlying computation? Can you find evidence of unfaithfulness? What happens if you edit seemingly key steps in the chain of thought?Can you create situations where the model does use suspicious reasoning? What happens if you tell it that its chain of thought is being monitored - does it do the bad behaviour but without suspicious reasoning?
- Can we probe for anything interesting in the chain of thought?Can we make steering vectors to steer the reasoning? Eg make it backtrack, make it be more creative, etcDo sparse autoencoders tell us anything interesting?
- You can use the r1 LLaMA 3.1 8B distill and see how well the LLaMA Scope SAEs transfer, you will likely get better results if you finetune the SAEs on the finetuned model activations, as they were trained on base LLaMA 3.1 8B
Sparse Autoencoders
In previous rounds I was predominantly interested in Sparse Autoencoder projects, but I’m comparatively less excited about SAEs now - I still think they’re cool, and am happy to get SAE applications/supervise SAE projects, but think they’re unlikely to be a silver bullet and expect to diversify my projects a bit more (I’ll hopefully write more on my overall takes soon).
Within SAEs, I’m most excited about:
- Work that tries to understand and measure fundamental problems with SAEs, eg:
Whether SAEs learn the “right” concepts
Whether our interpretations of SAE latents (aka features)[1] are correct
- I suspect our explanations are often way too general, and the true explanation is more specific (preliminary evidence)
- Standard autointerp metrics do concerningly well on randomly initalized transformers… (largely because they score latents that light up on a specific token highly, I think)
- Matryoshka SAEs (co-invented by my MATS scholar Bart Bussmann!)Improving our current techniques for using LLMs to interpret SAE latents
- Can SAEs work as good probes? (not really)Can SAEs help us unlearn concepts? (not really)Can SAEs help us steer models? (kinda!)
- Eg can we scale attribution-based parameter decomposition to small language models?How well do other dictionary learning methods perform on SAEBench?
- Can we find the “true” direction corresponding to a concept? How could we tell if we’ve succeeded?Can we find a compelling case study of concepts represented in superposition, that couldn’t just be made up of a smaller set of orthogonal concepts? How confident can we be that superposition is really a thing?Can we find examples of non-linear representations? (Note: it’s insufficient to just find concepts that live in subspaces of greater than one dimension)
- Why are some concepts learned, and not others? How is this affected by the data, SAE size, etc.
- Scaling Monosemanticity had some awesome preliminary results here, but I’ve seen no follow-ups
Model diffing
What happens to a model during finetuning? If we have both the original and tuned model, can we somehow take the “diff” between the two to just interpret what changed during finetuning?
- I think model diffing could be a very big deal, and has largely been under-explored! Many alignment relevant concerns, like goal seeking, planning capabilities, etc seem like they could be introduced during finetuning. Intuitively it seems like it should be a big advantage to have access to the original model, but I’ve not seen much work demonstrating this so far.Relevant prior work:
- Showing that finetuning models to not be toxic just disconnects relevant neurons, rather than deleting themShowing that finetuning enhances existing circuits by patching activations between the original and tuned modelFinetuning an original model SAE on the tuned dataset then tuned model, OR tuned model then tuned dataset, to better understand what happens to SAE latents and why
- Applying this to some of the small thinking models, like Qwen 1.5B r1 distilled, could be super interestingBase vs chat models is another super interesting direction
- It’d be best to start with a specific capability here. Eg refusal, instruction following, chain of thought (especially circuits re what the chain of thought should “look like”, even if it’s not faithful to the model’s reasoning), conversational style/tone, specialized knowledge (eg finetuning on Python code or French), hallucination/saying ‘I don’t know’
- Using a LoRA might make it even cleaner
- Some of my MATS scholars have trained open source crosscoders and training code: Connor Kissane and Clément Dumas & Julian Minder - these should be a good starting point
Understanding sophisticated/safety relevant behaviour
LLMs are getting good enough that they start to directly demonstrate some alignment relevant behaviours. Most interpretability work tries to advance the field in general by studying arbitrary, often toy, problems, but I’d be very excited to study these phenomena directly!
- Warning: The most interesting behaviours tend to happen in the largest models, eg LLaMA 405B. This is a pain to run yourself, and I do not recommend it unless you have experience with this kind of thing.In the recent alignment faking paper, they found that LLaMA 405B would sometimes fake alignment when prompted, without needing to reason aloud - I’d be really excited if you can learn anything about what’s going on here, eg with probes or activation patching from edited promptsChen et al shows that LLMs form surprisingly accurate and detailed models of the user, eg their gender, age, socioeconomic status, and level of education, and do this from very little information. They can find these with probes, and steer with these to change the model’s actions in weird ways.
- This is wild! What else can we learn here? What else do models represent about the user? How are these inferred? How else do they shape behaviour?Do LLMs form dynamic models of users for attributes that vary across turns, eg emotion, what the user knows, etc.
- As a stretch goal, do LLMs ever try to intentionally manipulate these? Eg detect when a user is sad and try to make them happy
Being useful
Interpretability is often pretty abstract, pursuing lofty blue skies goals, and it’s hard to tell if your work is total BS or not. I’m excited about projects that take a real task, one that can be defined without ever referencing interpretability, and trying to beat non-interp baselines in a fair(ish) fight - if you can do this, it’s strong evidence you’ve learned something real
- Projects here don’t need to use fancy techniques that “feel” mechanistic, like SAEs or circuit analysis. Simple things like probes, steering vectors, and saliency maps are great if we can rigorously show that they’re actually useful (but beware of illusions!)Examples:
- Using SAEs to steer models to not use regexes in their outputsUsing SAEs to fix spurious correlation in probesUsing training data attribution to remove noisy data points or filter for the best data to finetune onUsing steering vectors to cheaply jailbreak models (from Andy Arditi, when he was my MATS scholar)Using SAEs to do unlearning (loses to baselines)Using SAEs to probe in difficult situations (loses to baselines)
Investigate fundamental assumptions
There’s a lot of assumptions behind common mechanistic interpretability works, both scientific assumptions and theory of change assumptions, that in my opinion have insufficient evidence. I’d be keen to gather evidence for and against!
- Can we find examples of features in real language models that are not linearly represented?
- A good place to start might be digging more into the weirder parts of SAE error terms
- Feel free to use very high level circuit analysis, like attribution patching between that latent and latents in a much earlier layer, or even to input tokens
- Acausal crosscoders seem a good way to start studying this
- Are circuits real? Is this the right way to think abut models?Are features real? Do models actually think in concepts?Does it make sense to expect features to be linearly represented?
I favour the term latent over feature, because feature also refers to the subtly but importantly different concept of “the interpretable concept”, which an SAE “feature” imperfectly corresponds to, and it’s very confusing for it to mean both. ↩︎
Discuss