少点错误 2024年10月15日
Mechanistic Exploration of Gemma 2 List Generation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了Gemma 2 2b模型创建列表的能力及机制。通过利用稀疏自编码器和归因技术,研究所谓的‘列表结束行为’。创建了合成数据集,进行多种实验和分析,包括提出衡量模型列表结束倾向的指标、采用受生物信息学启发的可视化技术等。虽取得一定成果,但仍存在不足,也提出了未来研究方向。

🦘Gemma 2 2b模型具有创建列表的能力,为研究其机制,创建包含多种话题的合成数据集,并使用不同模板让模型生成列表,如基础模板、对比模板等。

📏提出衡量模型‘列表结束行为’的指标,即列表最后一个位置的空白和换行标记之间的对数差异,并选择RS SAEs和Attn SAEs的部分层进行实验。

🎨引入类似基因表达矩阵的绘图来可视化特征,以检测在示例子集中特别存在的特征局部社区,还进行了因果消融实验来测试重要特征对行为的影响。

📋实验结果显示,不同层的SAE对列表结束行为的抑制效果不同,如RS层0的SAE消融后平均项目数接近对比模板,而其他一些SAE的结果则接近另一模板。

Published on October 14, 2024 5:04 PM GMT

Summary

Preface

This post is a followup, of a 10h research sprint for a failed MATS application, the structure and methods are similar, but the content has been substantially expanded.

 

In this post, we will explore the mechanisms behind Gemma 2 2b’s ability to create lists of items, when prompted to.

Specially, we are interested in the mechanism by which Gemma knows when to end a list.

To do so, we leverage SAE and techniques like gradient-based attribution and causal ablations.

 

 

Data Generation

To investigate the behavior of Gemma when asked for a list, we create a synthetic dataset of model responses to several templates.

We ask GPT4-o to provide a list of topics to create lists about.

This results in 23 final topics, that we will ask Gemma 2 to create lists about.

Some examples, of those are: Vegetables, Countries, Movies, Colors, etc

We create several templates for Gemma to generate lists:

Template 1 (Base)

Provide me a with short list of {topic}. Just provide the names, no need for any other information.

Template 2 (Contrastive)

Provide me a with long list of {topic}. Just provide the names, no need for any other information.

Template 3

Provide me a with list of {topic}. Just provide the names, no need for any other information.

If not otherwise indicated, all the analysis and explorations were done with Template 1

For each topic, we sample 5 Gemma Completions with top-k = 0.9 and temperature=0.8

 

The generated outputs look like the following, colored by token/position type.

Structure of the Generated List, some of the tokens are in between tags for clarity.

The mysterious filler token 

By far, the most interesting behavior that we’ve observed trough different topics and temperatures is the model behavior of including blank tokens near the end of the list. The model includes white space tokens at the very last item position (like the Celery example) or in some cases the last few tokens.

Metric 

We use a proxy metric to capture the "List ending behavior", concretely the Logit Difference between the blank and line break token in the last token position of the last item in the list (the Celery position in the above example).

Full explanation in the link post.

 

In the investigation we selected 5 layers for RS SAEs and 5 layers for Attn SAEs.

SAELayers
Attn2, 7, 14, 18, 22
RS0, 5, 10, 15, 20

All the SAEs correspond to the 16k series, and have the sparsity of the ones present in Neuronpedia.

All the experiments with SAEs are done in a one-by-one basis.

Attribution of the Proxy Metric in the feature basis.

Given the vast number of features that get activated in the dataset it’s important to pin down the most important features for the behavior we are investigating, to do so we use Attribution techniques.

Concretely, we want to obtain the most important features that explain a given metric. This metric should reflect the behavior we are interested in.

In this case, given that we are interested in the “list ending behavior” a suitable metric can be the logit difference between the blank token and the line break token logits, at the position just before the end of the list.

We run attribution experiments for all the Attention and RS SAEs (one by one) across all the dataset, and keep the top (pos, feature) tuples.

Visualization of Features

We introduce a kind of plot that resembles Gene Expression Matrices, that enable us to detect local communities of features that are specially present in a subset of examples.

 

Feature Expression for important features in RS 5, across the whole dataset.

    In the bottom left corner, we can see a local community.The large horizontal area represents a subset of features that are important across the whole dataset.

 

Causal ablation Experiments

Experiments must be performed to empirically test whether the top important features affect in the expected way the behavior of interest.

To this end we performed causal ablation experiments that consisted in setting to 0 the top 5 most important (pos,features) and generating completions. This allows us to investigate whether the list ending behavior was maintained.

Method

Metrics

 

Results

 

We report the average difference in key metrics between the Ablated Generation and the Base Generation.

The full results are in tables in the link post.

If we average the ablation effects on generation across topics, we get the following key metrics.

 

 diversity_meandiversity_variancen_items_meann_items_variance
attn_layer_140.1114430.0480281.9630436.213043
attn_layer_180.0650510.0285950.7181161.526087
attn_layer_20.2219690.0649553.2652178.373913
attn_layer_220.0697070.0355570.9898552.765217
attn_layer_70.1771790.0430481.9833333.073913
res_layer_00.6283970.32512922.22173982.069565
res_layer_70.3291850.0805167.09855121.591304

 

One impressive example of this, is the SAE in the RS in layer 0, upon close inspection, the top-5 features for this SAE were very related to the short token, after ablation the average number of items is close to the Template 2.

For the other SAEs the resulting number of tokens was close to the Template 3.

The SAEs with less than 1 extra item in average can be classified as not effective at suppressing the list ending behavior.

Conclusions and Future Directions

In summary, applying Mechanistic Interpretability techniques in real world problems is not straight forward, and many compromises and assumptions must be made. This work pretended to be a first approximation in using MI techniques in scenarios that more resemble the real world.

Even though some amount of progress has been made into attacking this problem, this investigation falls short to the goals of mechanistic interpretability.

While the empiric results are encouraging, (full elimination of the list ending behavior with just 5 edits), the results are not fully satisfactory.

Even though ablating some key features prevents the list ending behavior, other simple methods do it also like restricted sampling, or steering vectors.

Future Work

Several areas for future research have been identified:

    Benchmarking different heuristics for selecting layers based on memory versus circuit faithfulness.Applying established techniques for feature explanation to the relevant dataset.Utilizing Transcoders and hierarchical attribution to develop complete linear circuits.


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Gemma 2 2b 列表生成行为 机制探索 稀疏自编码器
相关文章