Published on May 16, 2025 3:44 AM GMT

Status: Very early days of the research. No major proof of concept yet. The aim of this research update is to solicit feedback and encourage others to experiment with my code.

Main Idea:

Researchers are using SAE latents to steer model behaviors, yet human-designed selection algorithms are unlikely to reach any sort of optimum for steering tasks such as SAE-based unlearning or helpfulness steering. Inspired by the Bitter Lesson, I have decided to research gradient-based optimization of steering vectors. It should be possible to add trained components into SAEs that act on the latents. These trained components could learn optimal values and algorithms, and if we chose their structure carefully, they can retain the interpretable properties of the SAE latent itself. I call these fine-tuning methods Interpretable Sparse Autoencoder Representation Fine Tuning or “ISaeRFT”.

A diagram of an interpretable fine tuning component acting on the interpretable latents of a sparse autoencoder. The drawing depicts a feed forward neural network, though my recent research has focused on learning scaling factors to apply to features.

Interpretable components that act on SAE latents contribute to a number of safety-relevant goals:

Emergent Misalignment

If current SAE interpretation methods do a good job of reflecting their actual meaning to the model, then that should be reflected in which latents are changed during learning.For example, if you train a model on helpfulness text using ISaeRFT, but when you look at the trained components, features with labels that make them seem irrelevant are changed and features you would expect to be changed are unchanged, then this indicates a problem with the labeling method or the causal relevance of those features.

ISaeRFT ties nicely to reinforcement-learning based methods for guiding model behaviors, allowing us to understand what these methods taught the model. It can also be used to see what a dataset teaches a model.

If ISaeRFT fails to have meaningful successes on downstream tasks despite the optimization pressure, then that would be an interesting null result. It would reveal that steering based techniques for modifying model behavior are a misguided approach.

Project Update:

The code, though imperfect, works well enough that I encourage others to play around with it and see what you can discover.

The main features are:

Because I integrated interpretable fine tuning as a Huggingface PeftModel, It is now possible to use Huggingface trainers such as SFTTrainer and DPOTrainer to find optimal steering vectors.

DPO loss goes down significantly in as few as 20 steps.

IA3

I’m very thankful that Weights and Biases allows me to have over 32,000 plots. Each plot reveals how a learnable scaling factor affects the latent vector of the sparse autoencoder. Note how many of these are flat: that’s due to the fact that there are only nonzero gradients when a feature is active, and SAE features activate sparsely!

View My Runs:

While I’m writing this, I’m doing a hyperparameter sweep on a Gemma-2-2b Matryoshka SAE.

You can see the features being trained in real time here.

I’ve previously done some runs using non-matryoshka saes, which you can see here. For this sweep, the parameter plots themselves have descriptions, making it convenient to search the panels for features that interest you. You can always refer to Neuronpedia for more information about the features.

How to Use:

I’ll write more detailed documentation soon. If you use my code and have questions, please make a Git issue.

Setup:

Clone the repository.

Pip install the requirements file.

Review ia3_training_isaerft_disable_available.py to see how to optimize steering vectors.

src/export_wandb_data allows you to get data from weights and biases to find which features are scaled up and scaled down the most.

src/model_components contains the torch components which will be placed into the sparse autoencoders to process the latents. At the moment, only IsaerftIA3 is supported.

Also in the model_components folder are IsaerftPeft and IsaerftConfig.

Example usage (from ia3_training_isaerft_disable_available.py)

sae_release = "gemma-2-2b-res-matryoshka-dc"model_sae_id = "blocks.20.hook_resid_post"isaerft_config = IsaerftConfig(    target_hooks=[        (sae_release, model_sae_id),  # Match the SAE we loaded    ],    depth=None,      ia3=True)peft_model = IsaerftPeft(model, isaerft_config)

After this, peft_model can be used in the same way as a normal huggingface peft_model.^[1]

Next Steps:

Usually, people use default values for the adam optimizer’s beta1 and beta2 hyperparameters. My method has the strange property where gradients are sparse, because there are only gradients for feature i when feature i is present, and features activate sparsely by design. I suspect that this might mean that we want a larger beta2 to carry gradient information for longer. This may help prevent features from getting boosted up then much later boosted down, when the optimal value is in between.

Sparse Crosscoders

Conclusion:

If you find this interesting, please get in touch. I’m especially interested in hearing about research I may have missed which would benefit me to know.

^{^}
Oddly, having a Huggingface trainer evaluate your model during training is not available for peft models. I hope to find a workaround.

Discuss

Main Idea:

Project Update:

View My Runs:

How to Use:

Next Steps:

Conclusion:

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签