少点错误 01月21日
Easily Evaluate SAE-Steered Models with EleutherAI Evaluation Harness
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用稀疏自编码器(SAE)引导大型语言模型,并利用EleutherAI的评估工具进行模型评估。通过直接编辑模型的“思想”(特征),可以引导模型输出。文章详细说明了如何安装和配置评估工具,选择要编辑的模型、SAE版本、层和特征,并创建一个包含引导信息的CSV文件。最后,展示了如何运行评估命令来测试模型在MMLU等基准上的表现。该工具目前处于beta测试阶段,仅支持无条件特征添加,未来计划增加条件引导和其他干预方式。

🛠️ **安装与配置**:首先,需要从EleutherAI或作者的GitHub仓库克隆lm-evaluation-harness,并进行安装。这是使用该工具的基础步骤。

🧠 **选择干预**:使用Neuronpedia来选择要编辑的模型、SAE版本、层和特征。例如,可以选择gemma-2-2b模型的某个特定特征(如与“狗”相关的特征)进行编辑。

📝 **创建引导文件**:创建一个CSV文件,其中包含要编辑的特征索引、引导系数、SAE版本和ID等信息。这个文件是指导模型行为的关键。

🚀 **运行评估**:使用lm_eval命令运行测试,将模型设置为sae_steered_beta,并指定基本模型名称和CSV文件路径。评估结果将显示在指定的输出路径中。

🧪 **当前局限与未来展望**:目前仅支持无条件添加特征,未来将增加条件引导和其他干预方式。同时,其他类型的评估(涉及生成较长序列)尚不支持。

Published on January 21, 2025 2:02 AM GMT

(Status: this feature is in beta. Use at your own risk. I am not affiliated with EleutherAI.)

Sparse Autoencoders are a popular technique for understanding what is going on inside large language models. Recently, researchers have started using them to steer model outputs by going directly into the "brains" of the models and editing their "thoughts" — called "features".

EleutherAI's Evaluation Harness is an awesome project which implements a wide variety of model evaluations including MMLU, WMDP, and others. You can now test SAE-Steered Models on many of those sets! (Not all of the evaluations are available right now for SAE-Steered models: just ones that only need 'loglikelihood'.) That includes multiple-choice questionnaires such as MMLU and WMDP. 

As part of my research project in Northwestern's Computer Science Research Track[1], my team and I wanted to be able to easily evaluate steered models on benchmarks, MMLU and WMDP in particular. It was thanks to the AI Safety Fundamentals Course that I realized that I should put in extra effort to make the tool available to others and communicate how to use it. This constitutes my AI alignment project for the course.

How to Evaluate SAE-Steered Models

I've made a pull request to EleutherAI, so hopefully my contributions will be available there in the near future. If my work doesn't meet their criteria and they decline my code, then the steered model evaluation code will still be available on my fork here.

Step one: 

If EleutherAI has approved my pull request, run these lines (from the installation section of the README).

git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harnesscd lm-evaluation-harnesspip install -e .

If not, run these lines

git clone --depth 1 https://github.com/AMindToThink/lm-evaluation-harness.gitcd lm-evaluation-harnesspip install -e .

Step Two: Deciding on your interventions

You first need to ask:[2]

Neuronpedia is the place to go for this. If you select a model, you can search inside of it for SAE features.

When poking around, you might find that index 12082 in the gemma-2-2b model's 20-gemmascope-res-16k SAE refers to dogs and has a maximum observed activation of 125.39. Let's try adding twice the maximum activation to the dog feature and see what happens![3]

Step Three: Make a steer.csv file

latent_idx,steering_coefficient,sae_release,sae_id,description12082, 240.0,gemma-scope-2b-pt-res-canonical,layer_20/width_16k/canonical,this feature has been found on neuronpedia to make the model talk about dogs and obedience

The columns of the csv are:

This example csv file can be found in examples/dog_steer.csv

Step Four: Run the command

EleutherAI offers the lm_eval command which runs tests. For SAE steered models, set --model to sae_steered_beta[4]. SAE steered models take two --model_args, a base_name and a csv_path. base_name needs to be the name of a model which has pretrained SAEs available on sae_lens. csv_path is the path to your edits.

The following is an example command that tests the dog-obsessed model on abstract math from MMLU using GPU 0 and an automatically-determined batch size and puts the results in the current folder.[5]

lm_eval --model sae_steered_beta     --model_args base_name=google/gemma-2-2b,csv_path=/home/cs29824/matthew/lm-evaluation-harness/examples/dog_steer.csv     --tasks mmlu_abstract_algebra   --batch_size auto     --output_path . --device cuda:0

Future work:

Right now, only unconditionally adding a fixed-multiple of the feature is supported. In the future, it would be great to add conditional steering and other ways to intervene in the models.[6] 

There are still bugs to fix and functionality to implement. The other types of evaluations do not yet work (the ones involving generating longer sequences). In order to include those evaluations, we need to implement loglikelihood_rolling and generate_until.

Conclusion

Please enjoy evaluating your steered models! If you want to contribute (and if EleutherAI haven't approved my pull request) then you can make a pull request to my repo and I'll check it out.[7]

  1. ^

    Our project as a whole is to build upon prior work on Unlearning with SAE to make the process robust to adversarial attacks.

  2. ^

    ARENA's training materials are spectacular if you want to get into AI Safety. You can find all the materials here and the Interpretability with SAEs chapter here.

  3. ^

    This is the example ARENA used in their training materials.

  4. ^

    This feature is in beta because:

      It is largely untestedThe other types of evaluations do not yet work (the ones involving generating longer sequences)I'm new to both SAE_Lens and EleutherAI's evaluation harness, and I don't know whether I implemented everything properly
  5. ^

    Funny enough, the dog-steered gemma-2-2b model did slightly better than the default gemma-2-2b, but within the standard deviation. Both did really poorly, only slightly better than chance (.25).

    Default: "acc,none": 0.3, "acc_stderr,none": 0.046056618647183814

    Dog-steered: "acc,none": 0.34,"acc_stderr,none": 0.047609522856952344

  6. ^

    Here's the hook I'm using. Hooks are placed "in between layers" and process the output of the previous before returning it to the next layer. Once again, this is straight from ARENA.

    def steering_hook(   activations: Float[Tensor],  # Float[Tensor, "batch pos d_in"],   hook: HookPoint,   sae: SAE,   latent_idx: int,   steering_coefficient: float,) -> Tensor:   """   Steers the model by returning a modified activations tensor, with some multiple of the steering vector added to all   sequence positions.   """   return activations + steering_coefficient * sae.W_dec[latent_idx]
  7. ^

    I don't know how citations work. Can I get citations from people using my software? Do I need to wait until my paper is out, and then people can cite that if they use my software? If you know how software citations work, please drop a comment below.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

稀疏自编码器 模型引导 EleutherAI AI评估 特征编辑
相关文章