少点错误 2024年09月05日
Automating LLM Auditing with Developmental Interpretability
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨模型微调过程中特征的变化,开发自动模型审计方法,进行实验并分析结果

🎯我们提出模型在微调过程中,与微调目标相关的SAE特征在语义空间中的变化比其他特征更大,并基于此发现开发了一种自动模型审计方法。该方法具有低成本、高自动化、稳定性强、可广泛应用于多种模型和任务的优势。

🔬为确保实验有效性,选择多种任务作为目标任务,以Qwen-1.1 - 0.5B - chat、Gemma - 2B - it和Mistral - 7B - Instruct - v0.1为目标模型,在特定领域进行微调,并在任务特定和一般数据集上测试其前k个特征的交集。结果表明,任务特定数据集上的交集均值低于一般数据集,表明特定任务相关的模型特征在有针对性的微调后变化更大。

🧐此外,还分析了不同的前k值对交集均值差异的影响。结果显示,交集均值差异率随前k值的增加而降低,当前k值较高时收敛到一个常数值。但对于该现象背后的机制,目前尚未进行全面分析,未来将进行更全面的实验以了解其原因。

Published on September 4, 2024 3:50 PM GMT

Produced as part of the ML Alignment & Theory Scholars Program - Summer 2024 Cohort, supervised by Evan Hubinger

TL: DR

1. Introduction

The openness of powerful models brings risks. Currently, an increasing number of powerful models are being opened up through various means including open-source weights and offering APIs for fine-tuning, in order to meet users' customized needs. (Just when I was writing the draft of this post, I heard that OpenAI made finetuning GPT-4o available through API.) This process provides many potentially malicious users with opportunities to inject toxic content or backdoor switches, which is namely the challenge of open-source governance.

Evaluating frontier AI risk is challenging and costly. As model capabilities scale up, the potential security risks and failure modes for humans become increasingly complex. This is particularly true with recent research findings such as backdoor injection, deceptive alignments, and other security risks. Although some scholars have begun exploring the evaluation of these complex risks, it is clear that as the model capabilities scale up, it becomes increasingly difficult to directly assess the security risks from just the output content.

Insufficient model audit leads to failure. Given the risks stated in the paragraphs above, along with the decreasing difficulty of obtaining GPU power for model fine-tuning or continued pre-training, we can identify a potentially high-risk failure mode: Among the huge quantity of fine-tuned models, only a few have unsafe content injected, and the cost and time for conducting a comprehensive, fine-grained evaluation on these models (as an audit method) are extremely high. When only general evaluation tasks are performed in the audit game, some models with hidden backdoors can bypass this test and are deployed publicly, causing significant negative impacts, including but not limited to the activation of these backdoors, the spread of inappropriate values, or the models beginning to deceive users, leading to potential takeovers.

Developmental Interpretability. Developmental interpretability is an emerging subfield of mechanistic interpretability that focuses on understanding how models evolve and learn certain features from the dataset during training or fine-tuning processes (Olsson, C., et al., Nanda, N., et al.). Previous works on developmental interpretability mainly focused on a small amount of parameters and features. Thus, we are curious about the macro mechanism behind the finetune process:

What happens to the interpretable features during training/finetuning in statistical terms, and can we use these dynamics to audit the model?

Specifically, in this project, I intend to investigate the finetuning process using mech interp tools (specifically, SAE) to find and analyze how interpretable features move in semantic space after finetuning.

2. Methodology

Basically, our method involves comparing the feature movement of the models before and after finetuning on task-specific and general datasets. We picked top-k SAE features toward each token in the dataset and then calculate the feature changes of these top-k features before and after finetuning, denoted as . Using this method, we acquire  in the task-specific dataset and general dataset, denoted as  and , respectively. Last, we calculate the intersection of   and  to get the Intersection Difference Rate, which is used to measure the change in feature distribution related to the specific task. The bigger the Intersection Difference Rate is, the more SAE features related to the specific task are changed, and this suggests that the model has learned on the specific task more.

The feasibility and advantages of my approach compared to other methods in automated audit are explained below:

3. Experiments

To ensure the validity of our experiments, we selected various tasks as our target tasks and chose Qwen-1.1-0.5B-chat, Gemma-2B-it, and Mistral-7B-Instruct-v0.1 as our target models for conducting the experiments. We finetuned the target model in specific domains and tested their top-k feature intersection before/after the finetuning on task-specific and general alpaca datasets. The results are shown in the figure below.

Observations: On each task, the intersection mean on the task-specific dataset is lower than that on the general dataset. This indicates that the model's features related to the specific task are changed more after the targeted finetuning. Moreover, the intersection mean difference rate is bigger than 15%, suggesting that this change is significant.

4. Ablations

Moreover, we analyze the effect of different top-k values on the intersection mean difference. The results are shown in the figure below.


Observations: The intersection mean difference rate decreases with the increase of top-k values, and converges to a constant value when the top-k value is high. A similar phenomenon is also observed in other models, although detailed values are different. For example, Qwen is not behaving well and has a quite low Intersection Mean Difference, which is aligned with the evaluation result. This indicates that our methods' performance is not constrained by a large top-k value.

5. Limitations & Future Work

This is an ongoing project, and due to time and other limits, we didn't perform a full ablation study to our conclusions, e.g. ablation on data size and model architecture. Also, we only confirm that the SAE features related to the finetuning target will change greater than other features. However, we do not have a thorough analysis on the mechanism behind this phenomenon. We will conduct more comprehensive experiments to understand the reasons behind this phenomenon in the future.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

模型微调 自动审计 特征变化 实验分析
相关文章