少点错误 2024年10月06日
Interpretability of SAE Features Representing Check in ChessGPT
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文记录了对Othello和ChessGPT上的一些SAE的训练和评估。研究了SAE在这些语言模型第6层的情况,发现其捕获的潜在特征可完美或近乎完美地重构游戏状态的一些概念。还探讨了SAE特征的可解释性和与模型预测输出的相关性,通过多种实验和分析得出了一系列结论。

🎯约一半的SAE特征被分配到PGN模式的特定部分,一些特征专注于预测'.'后的令牌,另一些则专注于预测'x'后的令牌。

📋许多被确定为检查良好分类器的30个特征,在原始论文关注的子集情况('.')中,实际上并未影响模型的预测。

🧐一些捕捉“最后一个令牌是'.'且白方(要移动的玩家)处于被将军状态”信息的特征,可以以可解释的方式进一步描述,如某个特征可能表示白方具体受到来自骑士的将军。

💡激活某些SAE特征并抑制其余特征的实验表明,它们并不都看起来非常直观,但有一些不错的例子,如列名后肯定会跟一个数字,棋子标签后肯定会跟一个列名。

🚀通过对SAE特征进行激活修补,足以显著增加K的对数几率,表明模型确实表现得好像处于被将军状态。

Published on October 5, 2024 8:43 PM GMT

Produced by Jon Kutasov and David Steinberg as a capstone project for ARENA. Epistemic status: 5 days of hacking, and there could be bugs we haven’t caught. Thank you to the TAs that helped us out, and to Adam Karvonen (author of the paper our work was based on) for answering our questions and helping us debug early in the project!

Overview

This paper documents the training and evaluation of a number of SAEs on Othello and ChessGPT. In particular, they train SAEs on layer 6 of 8 in these language models. One of the interesting results that they find is that the latents captured by the SAE perfectly/almost perfectly reconstruct several of the concepts about the state of the game. 

ChessGPT is an 8 layer transformer model. It reads in PGN strings (that are of the form ‘;1.e4 e5 2.Nf3 …’) and predicts the next token in the sequence. Tokens in this case are single characters, and only those that can possibly appear in a PGN string. Karvonen et al. trained 500 SAEs on the residual stream after layer 6 of this model. Notably, they only used their model to predict the token after a “.” rather than at any point in a PGN string. 

We wanted to evaluate if these SAE features are 1. interpretable and 2. relevant to the prediction outputted by the model. Given that chessGPT is reasonably small, chess is a well structured game, and the game comes with nice ground truth labels for different features of the board state, this can be done in a much more efficient way than a similar study on a general LLM. 

Specifically, we focused on 30 SAE features that the original authors had found were perfect classifiers (using the metrics they had defined) of the question “is the player to move currently in check?” Our key findings are as follows:

All of our code is available in this repo: https://github.com/jkutaso/chessgpt_sae/tree/main

Interpretability of SAE Features

A PGN string is very structured. The pattern is always something like this ‘;1.e4 e5 2.Nf3 …’ with ‘{number}.’ followed by a move, followed by a space, followed by a move, followed by another space. Each move is of the form ‘{piece}{column}{row}’ with several exceptions: pawns don’t get a character, but others do (King=K, Queen=Q, etc.), if a piece is taken, there is an ‘x’ included, in the case of check or checkmate the move is followed by a + or #. Because of this structure, we are able to group all indices in all the game strings that represent the same part of the pattern. The arbitrary breakdown we use is: ['dot', 'column', 'piece', 'first_space', 'second_space', 'x', ‘miscellaneous’]. It’s not clear that this is ‘right’ but seems reasonable for a first pass attempt.

There are 4096 features in the SAE we studied. Of those, about 3900 are alive. We run a dataset of 100 game strings through the model, observing the activations of the SAE features at each token in each game. Then, for each feature, we filter for when the activations are greater than 20% (arbitrary number) of the max activation over the whole dataset for that feature. We then measure what fraction of these high activations occur within a single category (of the seven defined above). Below is a histogram of this value for all features.

 

The main takeaway here is that almost half of the features are associated with a single category, indicating that their only real job is to handle this case. It is likely that they serve multiple purposes (for example, the last token is in this category, and I want to contribute X to what the next logit should be). I believe that a better category separation would make this histogram even more concentrated at the top.

 

To further confirm that these features are classifying what part of the PGN pattern we’re in, we tried activating these latents and suppressing the rest. They don’t all look super intuitive, but here are a couple nice examples:

 

A column name is certainly going to be followed by a number.

 

A piece label is certainly going to be followed by a column name.

 

However, the SAE feature classification doesn’t stop there! We can also get some intuition for what individual features are representing within the groups we established above. Specifically, we focused on the features that represent check given that the last token was a period. This is right before white makes its move. 

For each feature, we define a custom function as followsFirst, we defined a few intuitive descriptors we can easily evaluate given a board state (where is the king, what piece is attacking the king, what direction is it attacking from, etc.). For each SAE feature, we found all the high activating board states and measured these descriptors for all of them. Any description that applied to most of the examples was kept as a classification tool. So, for example, if we found that most of the examples had the king under attack from either a queen or a bishop, we hypothesize this as a description of this feature. The custom function is defined as a function that checks if all of the descriptors established for this feature are satisfied.

For a concrete example, say we keep track of the piece attacking the king and the direction it is attacking from. We find for a specific feature’s high activation set that the piece attacking the king is always a bishop, but there are cases where it attacks from both directions. The custom function here will just check if the attacking piece is a bishop. However, if the attack was always from the same direction, the custom function would check both things. 

Just classifying the features based on these features was pretty successful. A few interesting features we found:

There were many features with less intuitive explanations, and even the interpretable ones above almost all had some examples that were randomly different. It is possible that we have just not come up with a way of thinking about the board that aligns perfectly with how the model is doing it, but these results show that there is potential in some sort of systematic interpretation of the SAE features.

Here is a plot showing the precision of classifying using the custom function generated as described above instead of just classifying by ‘is check’:

 

 

 

Activation Patching to Trick the Model into Moving the King

Based on the above model for how the features work, we should be well equipped to use this information to steer the model’s behavior. In particular, we use the intuition that a player is much more likely to move their king when in check vs when not in check. Thus, we can evaluate if the model thinks it is in check if it is increasing the logit for K. 

Throughout this section we intervened in the residual stream as follows:

Below are bar charts that show the effect of patching over all 30 of the initially identified SAE features to 0.3 * their max activation. We split into game states where the player to move is actually in check and cases where they aren’t. 

 

The logit for K goes from reasonably high to even higher when the player is already in check.
The logit for K goes from about 0 to by far the highest logit

These two plots confirm that these features do in fact get used in predicting the next token by capturing if the player to move is in check.

It is also worth noting that many of the logits that are invalid for this point in the PGN pattern get boosted by this intervention. This is likely because we are boosting features that represent being in a different part of the pattern (as discussed in the previous section).

To visualize the effect of this intervention for each individual feature, we plotted the logit affect vs the strength of intervention for each feature in both the check and non-check states. Here is one interesting example:

 

 

This feature pushes the logit for K significantly in both cases. However, it also boosts the logit for R or _, indicating that it may be doing different tasks depending on where we are in the PGN pattern and relies on other features to suppress it if it is not the right time for this. Thus, it may be one of the few features that aren’t strongly attached to a single point in the PGN pattern.

Such plots are available for all SAE features here: https://github.com/jkutaso/chessgpt_sae/blob/main/plots/logit_diff_by_feature_and_strength.png.

Can the model handle typos?

The last idea we looked into was the model’s ability to handle a typo. In particular, since the ‘+’ character is redundant in a PGN string, the model could theoretically be able to figure out the check state whether or not it gets the + in the PGN string. 

To test this, we first confirmed that the model can predict the ‘+’ token by feeding in a bunch of PGN strings cut off right before a + and confirming that it does do this successfully ~97% of the time. This indicates that the model does have the architecture to understand that it is in check before seeing a +.

Next, we evaluated how the model responds to getting a prompt where the + has been omitted. For example, replacing a PGN string like ‘... Nd4+ 26.’ with ‘... Nd4 26.’ The move stays the same and we still ask for the token that follows the ‘.’ to match the methodology of the original paper. In this case, we find that the 30 SAE features that measure check approximately never fire:

 

 

However, interestingly, the model is still surprisingly good at predicting legal moves in this case. 

 

This plot shows that the probability of returning a legal move when omitting the + in the PGN string drops from about 100% to 60%. However, if the model were just choosing random moves that would be legal aside from the fact that white is in check, we’d expect this drop to be even more significant. Thus, the model is still retaining some estimate of the information that it needs to get out of check despite not activating any of the features that seem to classify check.

This is surprising! We see that 1. omitting the + reduces the relevant SAE activations, and that 2. reducing these activations directly significantly reduces the logit for K. However, cutting the + does not make as large of a difference as those two observations would imply about the probability of generating legal moves. This is worth investigating further.

Future Work

Conclusion

Building on Karvonen et al’s work on interpreting SAE features in ChessGPT, we’ve found a few promising insights. We discovered that many features learned by the SAE are highly interpretable, aligning with specific aspects of the PGN structure and board state. Notably, features related to "check" were often specific to a certain context and activated predictably in the expected scenarios. However, we also uncovered complexities: some features seemed to serve multiple purposes depending on context, and the model retained some capacity to predict reasonable moves when a redundant (but expected) token was removed. We hope to continue working on this project beyond ARENA, and hope that these results can help in the search for mechanistic explanations of the behavior we’ve encountered.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SAE 特征 棋类语言模型 可解释性 模型预测 激活修补
相关文章