少点错误 01月02日

A Collection of Frames about Language Models

本文整理了一系列关于模型的框架，这些框架在经验上得到了验证，并揭示了模型的有用信息。文章将框架分为代表性框架和功能性框架。代表性框架侧重于模型的内部结构，通过分析其子组件来理解模型的工作方式；而功能性框架则关注模型与外部实体（如数据分布、学习目标等）的关系，从而揭示模型的特性。文章还提出了许多开放性问题，并鼓励读者提出建议以完善这些框架。本文旨在通过这种方式推动对模型的可解释性的研究，并为相关领域提供有价值的参考。

💡Transformer计算可以分解为近乎线性的“电路”，这些电路解释了它们如何计算简单的二元组/三元组。Transformer几乎普遍包含“归纳头”，用于检测/调节重复序列。

🧠Transformer将特征表示为几乎正交的方向上的叠加，其中可以有指数级的数量。特征可能实际上是以不同层的组合形式表示的。如果Transformer同时表示a和b，它们可以线性地表示“a XOR b”。

⚙️语言模型能够进行“内省”，即能够预测自己的一些特性，而这些特性是能力更强的模型无法预测的，这表明它们可以访问关于自己的“特权信息”。语言模型能够进行“上下文外的推理”，即能够将它们训练过的许多不同的事实拼接在一起，从而进行推断，也就是所谓的“连接点”。

🛡️语言模型能够进行“战略目标保持”，即能够在训练期间改变它们的响应，以防止它们的目标通过微调而改变。Transformer容易受到越狱攻击，因为有害和无害的提示在前几个标记中很容易区分，而数据增强可以解决这个问题。

Published on January 2, 2025 2:49 AM GMT

This post contains a collection of frames about models that are (i) empirically justified and (ii) seem to tell us something useful. In each case I've distilled the key idea down to 1-2 sentences and provided a link to the original source. I've also included open questions for which I am not aware of conclusive evidence.

I'm hoping that by doing this, I'll make some sort of progress towards "prosaic interpretability" (final name pending). In the event that I don't, having an encyclopedia like this seems useful regardless.

I'll broadly split the frames into representational and functional frames. Representational frames look 'inside' the model, at its subcomponents, in order to make claims about what the model is doing. Functional frames look 'outside' the model, at its relationships with other entities (e.g. data distribution, learning objectives etc) in order to make claims about the model.

---

This is intended to be a living document; I will update this in the future as I gather more frames. I strongly welcome all suggestions that could expand the list here!

Representational Frames

can be broken down into nearly-linear 'circuits'

induction heads

features in superposition

combination of different layers

linearly represent "a XOR b"

redundancy' / 'coverage' of features in the data

can compute boolean circuits in superposition

artefacts of the training process

register tokens

multi-token embedding

compute a bunch of random features in the early layers, sort out what's useful in the middle layers, then actually solve tasks in the late layers

"adversarial examples" look like natural examples

'belief states' in a fractal geometry

bag of heuristics

diverting model computation away from the 'basin' of misalignment-inducing neurons

linear separation between 'harmful' and 'harmless' contexts

refusal is well-represented linearly

---

(TODO think of some open questions which would directly indicate good frames)

Functional Frames

Frames

knee-jerk responses, persona simulations, and general world simulations

represent 'personas' in ways that make 'anti-personas' more likely to emerge

personas might yield useful information for determining other properties such as truthfulness

must

this could involve solving very hard subproblems

can give calibrated estimates of their ability to answer questions

out-of-context reasoning

connecting the dots

implicit meta-learning

strategic goal preservation

can alter their responses during training time to prevent their goals from being changed via fine-tuning

---

Changelog

2 Jan: Initial post

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

模型框架 Transformer 可解释性语言模型人工智能

相关文章

人工智能正在摧毁互联网内容生态系统

阿里云：通义千问API日调用量破亿企业用户破9万

【iThome 2024 CIO大調查系列1】AI、資安和永續變革三箭齊發

鈺登跨入AI，推出搭配英特爾四代Xeon SP與Gaudi2的伺服器

蘋果發表M4晶片，更新iPad產品線

Red Hat推出AI平臺，內建IBM開源Granite模型

【iThome 2024 CIO大調查系列 1｜IT投資趨勢】IT投資布局更多角化，雲端比重明顯增加