Published on May 20, 2025 5:09 AM GMT
I recently read a post by Dan Hendrycks from xAI criticizing Anthropic's focus on Sparse Auto-Encoders as a tool for mechanistic interpretability.
You can find that post HERE. Some salient quotes below.
On SAEs:
Another technique initially hailed as a breakthrough is sparse autoencoders (SAEs) [...] getting SAEs to work reliably has proven challenging, possibly because some concepts are distributed too widely across the network, or perhaps because the model’s operations are not founded on a neat set of human-understandable concepts at all. This is the field that DeepMind recently deprioritized, noting that their SAE research had yielded disappointing results. In fact, given the task of detecting harmful intent in user inputs, SAEs underperformed a simple baseline.
[...] despite substantial efforts over the past decade, the returns from interpretability have been roughly nonexistent. To avoid overinvesting in ideas that are unlikely to work, potentially to the neglect of more effective ones, we should be more skeptical in the future about heavily prioritizing mechanistic interpretability over other types of AI research.
On RepE:
Representation engineering (RepE) is a promising emerging field that takes this view. Focusing on representations as the primary units of analysis – as opposed to neurons or circuits – this area finds meaning in the patterns of activity across many neurons.
A strong argument for this approach is the fact that models often largely retain the same overall behaviors even if entire layers of their structure are removed. As well as demonstrating their remarkable flexibility and adaptability – not unlike that of the brain – this indicates that the individual components in isolation offer far fewer insights than the organization between them. In fact, because of emergence, analyzing complex systems at a higher level is often enough to understand or predict their behavior, while detailed lower-level inspection can be unnecessary or even misleading.
RepE can identify, amplify, and suppress characteristics. RepE helps manipulate model internals to control them and make them safer. Since the original RepE paper, we have used RepE to make models unlearn dual-use concepts, be more honest, be more robust to adversarial attacks, edit AIs’ values, and more.
I'm not sure why the article seems to be taking the framing that SAEs and RepE cannot co-exist as safety methods, if I were taking the most charitable interpretation of his point I think that Dan is arguing that investment should focus on RepE over SAEs.
However from my perspective there have been some encouraging results with SAEs. Golden Gate Claude, or even just being able to "clamp" and/or "upweight" circuits like we saw in the Dallas example of biology of an LLM would both seem to indicate that SAEs features are on the right track to interpretability for at least some concepts. Ultimately I don't see why RepE and SAEs can't both be valuable tools.
I would like to hear what other people think about this criticism. Is it valid and focus should no longer be put into SAEs, or is there enough "there there" that it's still an avenue worth pursuing?
Discuss