Are Sparse Autoencoders a good idea for AI control?

Published on December 26, 2024 5:34 PM GMT

Based on a 2-day hackathon brainstorm. Current status: 70% of the tooling is done, unsure of how to proceed. Not enough experience with multi-month sized projects to judge for feasibility.

I'm looking for some feedback. Specifically I want feedback regarding my current implementation. The statement "SAEs could be useful for AI Control" is obvious enough.

Working hypothesis

In general, AI control, is a good idea, and it's actually being done in production. In the sense of monitoring training/inference of an AI system with another (independent) AI system.

red team strategy

Setup:

Gemma 2b must check whether or not the reasoning is deceptive.

I've already check and the 2 models are capable of doing this.

GOAL: If we provide internal information to the supervisor (important SAE latents in the supervised model) does the deception detection rate increases?

Current status of the project:

Finetuned a SAE on deceptive/non deceptive reasoning traces from Gemma 9bLangchain pipeline for providing the Supervisor model transcripts of the reasoning as well as some of the most important features.

TODO

[ ] Label a subset of the features

[ ] Measure baseline performance of the supervisor

[ ] Provide the supervisor with mind reading information

Roadblocks:

There's no obvious answer to how much can we bias the feature selection process to make sure the information provided to the controller is useful, but not to biased.Lack of compute

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签