少点错误 2024年08月10日
Emergence, The Blind Spot of GenAI Interpretability?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了涌现现象在AI中的重要性,认为研究AI需将系统作为整体,以理解其行为,当前LLM解释性研究对涌现概念重视不足,应采取整体、互联、多层次的方法

🌊以水为例解释涌现现象:单个H₂O分子简单,大量分子结合产生具流动等宏观性质的水,即便能完美模拟水行为,预测新场景仍需完整模拟,说明研究系统整体的必要性

🤖AI中的涌现现象:认为LLM行为、认知和能力并非自动回归深度神经网络及训练的涌现属性的观点可能有误,随着系统更复杂,涌现现象会更明显

📚LLM解释性研究的问题:虽在LLM社区常讨论涌现行为,但对实际解释性研究影响小,研究仍集中在较小、易理解的系统,对具有趣涌现行为的专家混合系统的解释性工作少

📖呼吁整体方法:应避免简化主义,采取整体、互联、多层次的方法理解LLM,涉及各级抽象层面的研究,并注重各层面间的协同作用

Published on August 10, 2024 10:07 AM GMT

Epistemic status: This post was planned to be part of a broader "Holistic Interpretability" post but this isn't going as fast as I'd like so I am releasing the foreword to get some early feedback on whether I should pursue this or not. I haven't had a lot of red teaming on this emergence idea, but I'd still put a 80% confidence that this is important. I also tried to use the word emergence only where it made sense, but I put a 40% chance that I might have used it a bit wrong.

Emergence: More Than Just a Buzzword?

When I think about large auto-regressive models, I can't help but return to the concept of emergence. Now, I know what you're probably thinking - "emergence" has become something of a fuzzy buzzword in AI discussions, and I mostly agree with the critics, but I think there is more to unpack here.

Let's start with a classic example: water. A single H2O molecule is just a simple arrangement of atoms, but combine gazillions of these molecules, and suddenly you have a substance that flows, splashes, and takes the shape of its container. While individual molecules have properties like polarity and can form hydrogen bonds, the macroscopic properties we associate with liquids only become apparent when vast numbers of these molecules interact.

Now, you might argue, "But we can simulate water behavior perfectly if we know all the properties of individual molecules!" And you'd probably be right. However, even with a perfect simulation, to predict how water will behave in a new scenario, you'd still need to run the entire simulation. You can't shortcut it by just looking at a few molecules or an over-simplified model.

This illustrates why I believe emergence matters when discussing AIs. It's not about mystical properties appearing out of nowhere. Rather, it's about the necessity of studying the system as a whole to fully understand its behavior, even with complete knowledge of its components.

When I look at current efforts to interpret LLMs[1], I often have this impression that they bet on the idea that LLM behaviors, cognition, and capabilities are NOT emergent properties of auto-regressive deep neural networks and it's training. There seem to be an implicit belief that everything could be pieced out just by being good enough at understanding all the internal circuitry and components.

The default trajectory?

This approach seems to mirrors a common scientific reaction to studying complex entities: stripping them down to their constituent pieces, and trying to piece them back together using logical, sound (even rational) reasoning.

In the last few months, I've read about the trajectory of other sciences (psychology, economics, biology, physics, medicine, ...), and talked with practitioners, before finally stumbling on complexity science (I warmly recommend the book "Complexity: The Science at the Edge of Order and Chaos"). After having read about it, I felt like I was finally finding the words I had trouble piecing together in my head since ChatGPT-4 came out. "Complexity, emergent phenomena! That's it!" I thought:

Emergent phenomena are everywhere around us, weather patterns emerge from countless atmospheric interactions, stock markets fluctuate based on the behavior of millions individual, ant colonies exhibit complex social structures that no single ant could comprehend and even the intricate beauty of snowflakes arise from simple molecular processes. Given this ubiquity of emergence in almost every aspect of reality, why would we assume that training billions of small matrix multiplications to output coherent text wouldn't lead to similar emergent phenomena? To expect otherwise seems, dare I say, a bit naïve, and will become even more so as those systems become more and more complex.

LLM Interpretability

Ironically, while "emergent behaviors" are frequently discussed in the broader LLM community, this concept seems to have had little impact on actual interpretability research, and the term is often used without truly grasping its implications. The issue seem to have been postponed or overlooked, while research focuses on studying smaller, more "stochastic parrot" and easier-to-understand systems[2]. We're still largely focused on dissecting individual circuits, attention heads or direction in the residual stream, as if understanding each gear in isolation will reveal the full complexity of the clock.

One striking example for me is that I've seen very little (to no) interpretability work done on Mixture of Experts systems[3], whereas in the field, they seem to me like the models with the most interesting emergent behaviors and properties.

A Call for a Holistic Approach

To fully embrace and accept the complexity of auto regressive models we should not just avoid reductionism, but actively pursue a holistic, interconnected, multi-level approach to understanding LLMs. 

This approach should involve studying LLMs at various levels of abstraction, from individual components studied in isolation in toy models, to documenting and analyzing behaviors discovered "in the wild"[4]. Most importantly, it should focus on building synergies between these levels, much like the productive interplay between neuroscience and psychology in studying human cognition[5].

P.S. I'm currently working on a more extensive framework for holistic interpretability, which will explore various approaches to interpreting AIs and propose ways to build bridges between them, drawing inspiration from other scientific fields. If you are interested in participating (or just discussing your unique viewpoint) please reach out to me. I'd be particularly interested if you have idea or are working on interpretability approaches that you haven't seen elsewhere yet[6].

  1. ^

    And on a broader scope, probably all AI interp in general but I am a bit less knowledgeable there

  2. ^

    Anthropic’s later work on applying dictionary learning on Claude 3 attempts to tackle this. However, I feel like their approach mainly identified where the different concepts they searched for becomes readable (with human concepts) rather than uncovering Claude 3-specific concepts (that I would guess are very alien to ours).

  3. ^

    And it seem to even be the case for a lot of behavioral studies

  4. ^

    Most of the examples in this post (and many more) where discovered by open ended interaction with LLMs

  5. ^

    For example, Elizabeth Loftus' studies on the malleability of human memory (Loftus et al., 1996) laid the groundwork for neuroscientists like Yoko Okado and Craig E.L. Stark to investigate the neural underpinnings of these phenomena (Okado et al., 2005). Similarly, the discovery of mirror neurons by Giacomo Rizzolatti’s team (Rizzolatti et al., 1996) provided a neurological basis that psychologists like Niedenthal and her team leveraged to explore emotional embodiment (Niedenthal et al., 2009).

  6. ^

    An example of such an approach could be what I would call "Glitch Interp". This method is somewhat analogous to lesion studies in humans but involves finding or intentionally introducing glitches into AI systems. Some notable works that could be considered precursors to this approach are SolidGoldMagikarp, research on universal and transferable adversarial attacks or, on another type of DNN, patch attacks . While these studies weren't primarily aimed at interpretability, I think that pursuing an interpretability agenda using similar techniques could yield valuable insights into AI systems' inner workings.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

涌现现象 AI LLM解释性 整体方法
相关文章