少点错误 2024年07月16日
Mech Interp Lacks Good Paradigms
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

机械可解释性(MI)是一个技术性人工智能安全研究领域,它致力于理解深度学习模型的运作机制。本文探讨了MI领域中一些被认为是“范式”的概念,并分析了它们在解释深度学习模型方面的有效性。作者认为,虽然这些概念在一定程度上推动了MI的发展,但它们仍然存在局限性,需要进一步的研究和改进。

🤔 **数学框架:** 2021年Anthropic发表的“数学框架”论文是MI领域中一个重要的范式。该论文通过线性分解将Transformer的计算过程分解为大量几乎线性的函数。该框架在解释Transformer的内部运作机制方面取得了显著进展,并得到了实证验证。然而,作者认为该框架的抽象层次不够高,未能充分解释模型中包含的结构信息。

💡 **线性表示假设:** 该假设认为深度学习模型的内部表示可以用线性函数来描述。尽管该假设在解释某些模型的运作机制方面具有一定的效力,但它缺乏严格性,无法作为理论的基础。作者认为,该假设过于简化,无法完全解释深度学习模型的复杂性。

🤯 **叠加假设:** 该假设认为深度学习模型的内部表示可以看作多个简单表示的叠加。该假设在解释模型的泛化能力方面具有一定的启发性,但它缺乏实证支持,无法作为构建理论的基石。作者认为,该假设过于抽象,无法解释模型的具体运作机制。

🚀 **未来的研究方向:** 作者认为,MI领域需要更加深入的研究,以克服现有范式的局限性。未来的研究方向包括:建立更抽象的理论框架,更好地解释模型的结构信息;验证现有假设的有效性,并提出更具解释力的假设;探索新的解释方法,更有效地理解深度学习模型的内部机制。

Published on July 16, 2024 3:47 PM GMT

Note: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free! 

Note 2: I originally sent this link to some people for feedback, but I'm having trouble viewing the comments on the draft. The post is also in a reasonably complete state, so I've decided to just publish it. If you're one of those people, feedback is still very much welcome in the actual comments section! 

Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior researchers' first foray into AI safety research is in MI (myself included); indeed, this occurs to the extent where some people feel MI is over-subscribed relative to other technical agendas. However, how useful is this MI research?  

A very common claim on MI's theory of impact (ToI) is that MI helps us advance towards a "grand unifying theory" (GUT) of deep learning. One of my big cruxes for this ToI is whether MI admits "paradigms" which facilitate correct thinking and understanding of the models we aim to interpret. 

In this post, I'll critically examine several leading candidates for "paradigms" in MI, considering the available evidence for / against, and identifying important flaws or holes which point to good future research (IMO). At the end, I'll conclude with a summary of the main points and an overview of the technical research items I've outlined. 

Towards a Grand Unifying Theory (GUT) with MI

Proponents of this argument believe that, by improving our basic understanding of neural nets, MI yields valuable insights that can be used to improve our agents, e.g. by improving architectures or by improving their training processes. This allows us to make sure future models are safe and aligned. 

Some people who have espoused this opinion: 

As a concrete example of work that I think would not have been possible without fundamental insights from MI: steering vectors, a.k.a. representation engineering, and circuit breakers, which  were obviously inspired by the wealth of work in MI demonstrating the linear representation hypothesis. 

It's also important to remember that the value of fundamental science often seems much lower in hindsight, because humans quickly adjust their perspectives. Even if MI insights seem like common sense to us nowadays, their value in instrumenting significant advances can't be overstated. 

(Aside) A corollary of this argument is that MI could likely have significant capabilities externalities. Becoming better at building powerful and instruction-aligned agents may inadvertently accelerate us towards AGI. This point has been made in depth elsewhere, so I won't elaborate further here. 

A GUT Needs Paradigms

In his seminal book, The Structure of Scientific Revolution, Thomas Kuhn catalogues scientific progress in many different fields (spanning physics, chemistry, biology), and distills general trends about how these fields progress. Central to his analysis is the notion of a "paradigm" - an overarching framework for thinking about a field. Kuhn argues that the establishment of accepted paradigms is a sign of "maturity" in the development of a field. 

Paradigms Are Instrumental for Progress

In the absence of a paradigm, it's very hard to draw the right conclusions from data, for two reasons. 

Multiple hypotheses could explain the data. Kuhn argues that, in the absence of a paradigm, a reasonable researcher might reach "any one of a number of incompatible conclusions". For example, we might incorporate variables with no actual predictive power into our explanation, like an ancient guru using the motions of stars to predict the future. The variables we choose to use are a function of our prior experience in other field and "accidents" in the process of our investigation. 

We may not have sufficiently good mental abstractions to understand what we're seeing. 

A famous thought experiment in neuroscience  considers what results popular interpretability techniques would yield on microprocessors, which are an example of a complex information-processing systems that we understand at all levels. Techniques such as "lesion experiments" (a.k.a activation patching) completely fail to elucidate the underlying structure - without a more advanced interpretation, MI experiments aren't very useful. 

As another example of how an absence of paradigms leads to illusions, consider Lucius Bushnaq's thought experiment on interpreting a hypothetical feature tracking entropy of a physical system

It seems to sort of activate more when the system is warmer. But that's not all it's doing. Sometimes it also goes up when two separated pockets of different gases mix together, for example. Must be polysemantic.

In a pre-paradigmatic field, Kuhn argues that all research amounts to no more than a "fact-gathering" exercise conducted mostly at random. The value of a paradigm is in forcing concerted and systematic study of a common set of phenomena in a standardised way, facilitating subsequent progress in the field. Furthermore, by concentrating attention on "mysteries" which the paradigm fails to explain, a paradigm sets up subsequent work to find the next paradigm. 

Three Desiderata for Paradigms

In my view, which differs slightly from Kuhn's treatment, a paradigm has three important properties.

    A paradigm has good epistemics, i.e. we believe it to be true because it explains existing data very well, or because it is strongly implied by other things we assume to be true. A paradigm is general, i.e it applies to many seemingly-distinct cases with a high degree of accuracy. For example, Newton's laws of gravitation explain both the motion of celestial bodies and the trajectories of thrown objects. A paradigm is open-ended, i.e. can be easily used as a building block for future work. For example, the results associated with a paradigm may result in useful practical applications. (Note: this necessitates a minimum degree of rigour.) Alternatively, attempting to validate the premises associated with a paradigm might point the way to very informative experiments. 

In the subsequent discussion, I will continually return to these criteria for evaluating subsequent paradigms (or "precursor" paradigms). 

Examining Paradigms in Mechanistic Interpretability

Our most "paradigm-y" things at the moment include:

TLDR; my view is that these all fall short in some way. In the first case, it actually is a great paradigm, just insufficient for what we want. In the latter two cases, they're not sufficiently rigorous to serve as building blocks for theory.  

A Mathematical Framework of a Transformer Circuit 

The mathematical framework blogpost, published by Anthropic in 2021, is a seminal example of what I consider to be a great paradigm in MI, perhaps my favourite, that has pushed the field forward a lot.

An overview:

Epistemics: 5/5 

Generality: 5/5

Open-Endedness: 5/5

Note there are still some minor problems re: epistemics. 

However I think that these holes in the theory mostly amount to "book-keeping", and can be resolved with some technical care. I also think that these holes don't matter that much in practice, and the wealth of successful work on analyzing transformer circuits is significant evidence in favour of this point.  

Instead, my main criticism is that "A Mathematical Framework" is not high-level enough. As excellent as it is, this paradigm feels like a level of abstraction "below" what we want. 

Concrete analogy: Answering biological questions with chemistry. If we think of a biological organism (aka a model), "Mathematical Framework" is like a complete description of its chemistry (aka circuitry). The latter is definitely foundational and informationally-complete w.r.t the former. At the same time, it is totally insufficient for answering higher-level questions like "does it have goals?" 

The Linear Representation Hypothesis

An overview of the LRH. 

Epistemics: 3/5

Generality: 2/5

Open-Endedness: 2/5

My biggest problem with LRH is that it's very difficult to make meaningful predictions a priori with it. 

The Superposition Hypothesis

An overview of superposition. 

I think the assumptions here are quite reasonable, which facilitates a high generality score. However, the epistemics could be better. 

Epistemics: 3/5

The largest body of evidence for superposition so far is that we consistently observe polysemantic neurons in models, across all sizes, architectures, and domains. Polysemanticity is self-evident. 

Secondly, superposition has also been demonstrated in toy models. However, it's unclear what disanalogies there are between superposition in toy models and superposition in real LLMs. For example, the assumptions about the data may not hold, or maybe two-layer ReLU networks are qualitatively quite different from much deeper transformer architectures. 

The last main evidence in favour of superposition is that interp approaches inspired by superposition, i.e. sparse autoencoders, have seemed to work really well. However, this evidence is also not very solid. 

As a specific alternative hypothesis on what SAEs are doing, a common argument is that SAEs simply cluster the data, and interpretability comes from having tighter clusters. 

Generality: 4/5

Open-Endedness: 5/5 

Assuming superposition as a concept turns out to be basically correct, it illuminates a lot of useful follow-up work. 

A point which I have not seen made elsewhere: I think we have not yet fully elucidated the "kinds" of superposition that can occur in models. Arbitrarily complex things can be in superposition, and features could just be the tip of the iceberg. 

Overall, I am actually quite confident that superposition is essentially correct. That is why I'm currently working on circuit analysis using SAEs. But I think there's insufficient evidence atm to reject leading alternative hypotheses and cement its status as a paradigm. 

Other Bodies of Theory

Note: I am considerably less familiar with these other bodies of theory than I am with the preceding 3, so there may be errors or inaccuracies here. Please feel free to point those out as necessary

There are some other bodies of theory which currently don't make the top cut, but which I think are promising nonetheless as things which could yield paradigms, given time

Singular Learning Theory

Ratings are offered here, but they should be taken with a very large pinch of salt because I am not very familiar with SLT

Overview: 

Epistemics: 5/5

Generality: 3/5

In principle, because SLT covers all things within the class of "singular" models, the claims are extremely general - even more so than "A Mathematical Framework", which only applies to neural nets, and even more specifically only transformer architectures. 

However, I'm not confident that SLT results are general, for the following reasons

Open-Endedness: 3/5

On the whole, I think there is a lot of untapped potential here for SLT to be a paradigm, but this potential is quite far from being fully realized at the moment due to both issues with communicating the foundations of SLT to the broader MI community and a lack of "killer applications". 

Computational Mechanics

No rating is offered here because I haven't engaged sufficiently with the material. I'm including Comp Mech mostly for completeness 

The seminal work that I am aware of is about the fact that transformers model belief states using simplices, which are a specific representational geometry. Brief comments here: 

I don't have great takes on what Comp Mech aims to do as a field, and in any case it hasn't made significant impact (yet) on the MI literature. I'll revisit this in the future if it becomes relevant

The Polytope Hypothesis

This is the idea that the correct atom for models' feature geometry are "polytopes". This is a very nascent trend I observe in some recent papers [Kiho Park et al, Adam Shai et al, Polytope Lens, and circular features in day-of-the-week math]. I intend to write a more complete note about this in a follow-up work. 

Distilling A Technical Research Agenda

Note: Here, I summarize technical research items proposed in previous sections, which I think would be exciting.  I've omitted the "other bodies of theory" for now because I think my takes will not be very good. 

Generally: More killer applications of existing paradigms. 

On "Mathematical Framework": 

On the LRH: 

On superposition: 

Conclusion 

In summary, I think it's important to critically evaluate whether MI has succeeded in delivering general paradigms with high explanatory power over nontrivial phenomena. My take on this so far is that we have a few separate attempts but all of these are lacking at the moment. That's alright, since incomplete paradigms are still useful, and this highlights good avenues for future research. 

If you made it this far, thanks for reading my rather rambly thoughts. Feedback is appreciated! 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机械可解释性 深度学习 人工智能安全 范式 Transformer
相关文章