少点错误 02月03日
Exploring how OthelloGPT computes it's world model
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

作者完成学士论文项目后分享研究成果,内容包括对OthelloGPT的多种发现,如Previous Color Circuit、Attention Heads的特点、Flipping Circuit Hypothesis等。

🎯提出Previous Color Circuit,解释Attention layers如何复制前一步棋的颜色。

👀发现Mine-Heads和Yours-Heads,且大多数Attention Heads属于此类。

🔍阐述Attention Patterns的特点,如几乎恒定且对远处移动关注较少。

💡提出Flipping Circuit Hypothesis及相关实验。

Published on February 2, 2025 9:29 PM GMT

I completed this project for my bachelor's thesis and am now writing it up 2-3 months later. I think I found some interesting results that are worth sharing here. This post might be especially interesting for people who try to reverse-engineer OthelloGPT in the future.

Summary

What’s OthelloGPT

Small findings / Prerequisites

The other sections will build on top of this section

Mine-Heads and Yours-Heads

Average attention paid to positions an even number of steps away minus attention paid to positions an odd number of steps away, for each Attention Head and layer. Mine Heads are seen in blue, and Yours Heads in read. "Last," "First," and other types of heads are also visible. L4H5, for example, is a "Last" HeadMost Attention Heads are Mine- or Yours-Heads

Attention Patterns are almost constant (across inputs)

Attention Pattern for Layer 3, Head 0, Position 2

Visualising the board state Accuracy over every Layer and sequence position

Accuracy of the Linear Probe Across Layers and Sequence Positions (ignoring empty tiles)

The Flipped probe

The Previous Color Circuit

Overview of the actions performed by the OV circuits of attention heads involved
in the previous color circuit.
Cosine similarities of different features after the OV circuit of Mine/Yours Heads

Example of the Previous Color Circuit in Action

Example showcasing the Previous Color Circuit. Columns represent Layer and next Transformer Module (Attn/MLP). Rows represent sequence Position. Tiles use a blue/red color scale, with blue representing black and red representing white. Tiles flipped in the board representation are marked with a white/black rectangle.
Direct logit attribution to "D3 is mine" at layer 1, position 19 for each attention head and sequence position.

Quantifying the Previous Color Circuit

Average accuracy of the previous color circuit for layer 2 over all sequence
positions split into tiles on the rim and tiles in the middle of the board
Average accuracy of the previous color circuit at sequence position 15 over all
layers, split into tiles on the rim and tiles in the middle of the board

Attention Heads Perform the Previous Color Circuit on Different Regions of the Board

Contribution of each Attention Head to the "Tile is Yours" - "Tile is Mine" direction, when the head pays attention to a previous sequence position where the tile was flipped

A Flipping Circuit Hypothesis

Summary

Classifying Monosemantic Neurons

Direct Logit Attributions of Neurons in Layer 1 to "D3 is Flipped"
Neuron weights of L1N1411 projected to different linear probes. The top row displays input weight projections, while the bottom row shows output weight projections

Let ? denote the set of rules. Each rule ? ∈ ? is defined by:

Example of a rule (t="C2 Yours", F=UP-LEFT, n=1)

A rule ?(?) evaluates to true for a residual stream ? when ? "mine" tiles need to be flipped in the specified direction, before reaching tile ?, which is yours.

Histogram with number of neurons on the x-axis and number of rules with that many corresponding neurons on the y-axis, for each layer. y-axis is log scale. Bin Size of 3

Testing the Flipping Circuit Hypothesis

Accuracy of flipping circuit across layers. Green represents the baseline, red the standard setup, and blue another setup where additionally the neuron activation of the rule-neurons are approximated by their average activation on positive samples (where the rule is active). Solid lines indicate accuracy for tiles in the board’s center, while dashed lines represent accuracy for tiles on the board’s rim.
Accuracy Comparison of flipping Circuit Variants Against the Baseline
Average Number of Neurons in Flipping Circuit per Layer

Conclusion

Next Steps

An Unexpected Finding

Contact

 

  1. ^

    I edited the Flipped direction to be orthogonal to the Yours direction. The effect of the Yours/Mine direction is stronger on the rim of the board, but I don't have a visualization on hand

  2. ^

    0.17 is roughly the minimum of the GELU function. So a mean activation difference suggests that the neuron has a positive activation when the rule is true and a negative evaluation otherwise.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

OthelloGPT Previous Color Circuit Attention Heads Flipping Circuit Hypothesis
相关文章