少点错误 05月16日 11:52
Interpretable Fine Tuning Research Update and Working Prototype
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了ISaeRFT(可解释稀疏自编码器表示微调)方法,这是一种利用梯度优化来引导模型行为的新技术。研究者受到“Bitter Lesson”的启发,探索通过训练组件来优化稀疏自编码器(SAE)的潜在变量,从而实现对模型行为的精细控制。ISaeRFT旨在提高数据审计能力、验证SAE解释方法、深入理解与安全相关的行为,并通过模型差异化分析来评估其有效性。该方法目前处于早期研究阶段,作者鼓励其他研究人员参与实验和反馈。

💡ISaeRFT的核心在于使用梯度优化技术,训练组件来影响稀疏自编码器(SAE)的潜在变量,从而实现对模型行为的定向引导。

🔑ISaeRFT旨在提升数据审计能力,通过观察训练过程中特征的变化,识别数据中的有害偏差和潜在的对齐问题,帮助理解和解决模型中的安全隐患。

🔬该方法可用于验证SAE解释方法,如果ISaeRFT在训练过程中能够改变与预期含义相符的潜在变量,将支持现有的SAE解释方法。

⚙️ISaeRFT与基于强化学习的方法相结合,有助于理解这些方法如何训练模型,并揭示数据集对模型的影响。

🧪研究人员鼓励通过实验,探索ISaeRFT在不同模型层、下游任务和注意力机制中的应用,并比较优化后的引导向量与手动设计的向量。

Published on May 16, 2025 3:44 AM GMT

Status: Very early days of the research. No major proof of concept yet. The aim of this research update is to solicit feedback and encourage others to experiment with my code. 

GitHub repository

Main Idea:

Researchers are using SAE latents to steer model behaviors, yet human-designed selection algorithms are unlikely to reach any sort of optimum for steering tasks such as SAE-based unlearning or helpfulness steering. Inspired by the Bitter Lesson, I have decided to research gradient-based optimization of steering vectors. It should be possible to add trained components into SAEs that act on the latents. These trained components could learn optimal values and algorithms, and if we chose their structure carefully, they can retain the interpretable properties of the SAE latent itself. I call these fine-tuning methods Interpretable Sparse Autoencoder Representation Fine Tuning or “ISaeRFT”.

A diagram of an interpretable fine tuning component acting on the interpretable latents of a sparse autoencoder. The drawing depicts a feed forward neural network, though my recent research has focused on learning scaling factors to apply to features.

 

Interpretable components that act on SAE latents contribute to a number of safety-relevant goals:

 

Project Update:

The code, though imperfect, works well enough that I encourage others to play around with it and see what you can discover. 

The main features are:

DPO loss goes down significantly in as few as 20 steps.
I’m very thankful that Weights and Biases allows me to have over 32,000 plots. Each plot reveals how a learnable scaling factor affects the latent vector of the sparse autoencoder. Note how many of these are flat: that’s due to the fact that there are only nonzero gradients when a feature is active, and SAE features activate sparsely! 

 

 

 

View My Runs:

While I’m writing this, I’m doing a hyperparameter sweep on a Gemma-2-2b Matryoshka SAE

 

You can see the features being trained in real time here.

 

I’ve previously done some runs using non-matryoshka saes, which you can see here. For this sweep, the parameter plots themselves have descriptions, making it convenient to search the panels for features that interest you. You can always refer to Neuronpedia for more information about the features.

How to Use:

I’ll write more detailed documentation soon. If you use my code and have questions, please make a Git issue.

 

Setup:

Clone the repository.

Pip install the requirements file.

 

Review ia3_training_isaerft_disable_available.py to see how to optimize steering vectors.

 

src/export_wandb_data allows you to get data from weights and biases to find which features are scaled up and scaled down the most.

 

src/model_components contains the torch components which will be placed into the sparse autoencoders to process the latents. At the moment, only IsaerftIA3 is supported. 

Also in the model_components folder are IsaerftPeft and IsaerftConfig. 

 

Example usage (from ia3_training_isaerft_disable_available.py)

 

sae_release = "gemma-2-2b-res-matryoshka-dc"model_sae_id = "blocks.20.hook_resid_post"isaerft_config = IsaerftConfig(    target_hooks=[        (sae_release, model_sae_id),  # Match the SAE we loaded    ],    depth=None,      ia3=True)peft_model = IsaerftPeft(model, isaerft_config)

 

After this, peft_model can be used in the same way as a normal huggingface peft_model.[1]

Next Steps:

 

Conclusion:

If you find this interesting, please get in touch. I’m especially interested in hearing about research I may have missed which would benefit me to know.

  1. ^

    Oddly, having a Huggingface trainer evaluate your model during training  is not available for peft models. I hope to find a workaround.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ISaeRFT 梯度优化 稀疏自编码器 模型行为
相关文章