少点错误 2024年07月22日
Initial Experiments Using SAEs to Help Detect AI Generated Text
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

这篇博文介绍了使用稀疏自动编码器 (SAE) 来检测 AI 生成文本的一些初步结果。作者使用 GPT-2-Small SAE 来寻找 AI 生成文本和人类生成文本之间的差异特征,并发现基于 SAE 激活的线性分类器在性能上略微优于一些简单的基线方法。此外,作者还发现,仅使用少数几个 SAE 神经元进行分类仍然可以获得良好的性能。

😁 使用稀疏自动编码器 (SAE) 来检测 AI 生成文本的研究表明,基于 SAE 激活的线性分类器在性能上略微优于一些简单的基线方法,这与 Gao 等人 (2024 年,附录 H) 的最新研究结果一致。

🤩 作者还发现,仅使用少数几个 SAE 神经元 (例如 4 个而不是 24,576 个) 进行分类仍然可以获得良好的性能。

🤔 然而,对这些差异最大的神经元进行手动检查的结果喜忧参半。一些神经元似乎代表了可以理解的概念,主要针对特定单词或 ngram (字符序列) 激活,但作者无法理解其他神经元。

🧐 使用 ngram 频率进行分类的性能比相关 SAE 神经元差,这表明这些 SAE 神经元可能代表了比仅仅是 ngram 更复杂的东西。

🤨 作者还指出,使用 SAE 进行 AI 文本检测的一个主要限制是,作者使用了简单且非对抗性的设置,即所有 AI 生成文本都是由 GPT-3.5-Turbo 生成的,并且是对类似提示的响应。另一方面,将“分类 GPT-3.5-Turbo 与人类文本”的任务从 GPT-2 和 SAE 的训练分布中移除,并且事先并不清楚它们是否会代表必要的 信息。

Published on July 22, 2024 5:16 AM GMT

This is an informal progress update with some preliminary results using Sparse Autoencoders (SAEs) to detect AI-generated text.

Summary

I would like reliable methods for distinguishing human and AI-generated text. Unfortunately, not only are the current best methods for doing this unreliable, they are also opaque: based on training large neural networks to do the classification for us. Fortunately, advances in AI interpretability, like Sparse Autoencoders (SAEs), have given us methods for better understanding large language model (LLM) internals. In this post, I describe some early experiments on using GPT-2-Small SAEs to find text features that differ between AI and human generated text. I have a few main results:

One major limitation is that I use a simple and non-adversarial setting — the AI generated text is all generated by GPT-3.5-Turbo and in response to similar prompts. On the other hand, the task of “classifying GPT-3.5-Turbo vs. human text” is off the training distribution for both GPT-2 and the SAEs, and it’s not a priori obvious that they would represent the requisite information. Working on this project has left me with the following high-level outlook about using SAEs for AI text detection:

Experiments

General Methods

Linear classifiers on SAE neurons perform better than reasonable baselines

To test how SAE-based classifiers compare to other classifiers for this task, I train simple logistic regression classifiers on a variety of data. In addition to comparing SAE and GPT-2 activations at all layers (Figure 1), I do more analysis at layer 8 (chosen for the high classification accuracy of GPT-2 activations in development). I include the following baselines:

I chose these baselines to get some mix of “obvious things to compare SAEs to” and to cover other linguistic feature methods which are human-understandable and have been found to be effective in previous work.[3] 

Figure 1

Besides the first few layers, classifiers on SAE activations slightly but consistently outperform GPT-2 activations at every layer. Classifiers are fit on the train set; these results are on a held-out validation set not used during development. Results on the development test set, which motivated the focus on layer 8, are similar. Note the y-axis scale.

Table 1

Performance of different methods on held-out validation set (except Dev Test Acc which is from the test set used in development). GPT-2, SAE, SAE Reconstructed, and PCA results are all based on just the pre-layer 8 residual stream.

MethodAccuracyF1 ScoreAUROCDev Test Acc# Features
GPT-2 Activations0.9240.9240.9780.919768
SAE Activations0.9440.9440.9850.94124576
SAE Reconstructed Acts0.9180.9190.9750.908768
Ngram Count Full0.9090.9090.9730.885104830
Ngram Count Simple0.9040.9020.9680.8975570
TF-IDF Simple0.9020.9010.9640.8815570
Linguistic Features0.6540.6610.7090.64512
Text Embedding 3 Small0.850.8480.9270.8331536
pca_10.6140.6240.6430.6481
pca_20.6880.6890.7570.6952
pca_320.8860.8860.9510.88132

Using only a few SAE neurons still works well

Given that SAE activations are effective for classifying AI and human generated text in this setting, it is natural to ask whether using only a few SAE neurons will still work well. The ideal situation would be for a few neurons to be effective and interpretable, such that they could be replaced by human-understandable measures.

Starting with all 24,576 SAE neurons at layer 8, I narrow this down to a short list of SAE neurons that seem particularly good for classifying AI-generated vs. human-generated texts. I apply a kitchen-sink style approach to narrow this down with various methods: logistic regression and selecting large weights (i.e., Adaptive Thresholding), Mann-Whitney U test, ANOVA F-value, and more. That gets to a short list of ~330 neurons, but this length is arbitrary based on hyperparameters set in an ad hoc manner. With a short list like this, it is feasible to directly apply sklearn’s RFE with a step size of 1 (Adaptive Thresholding is RFE with a step size equal to 0.5*n, or step=0.5, as far as I can tell): train a classifier on n features, eliminate the single worst feature by coefficient absolute value, repeat until you reach the desired value of n. Here is the performance of classifiers trained for different top-n neurons:

Figure 2

Performance of classifiers trained on increasingly small sets of SAE neurons / "features", specifically area under the curve and accuracy. Top SAE neurons and their classifiers are fit on the train set; these results are on a held-out validation set not used during development. Results on the development test set are similar.

That’s exciting! Using only a few of the top neurons is still effective for classification.

What are these neurons doing?

One hypothesis for what’s going on is that these top SAE neurons correspond to specific ngrams whose frequency differs between the two texts. For instance, some people have observed that ChatGPT uses the term “delve” much more than human authors — maybe one of the SAE neurons is just picking up on the word “delve”, or this dataset’s equivalent. Some of my manual inspection of top-activating text samples also indicated SAE neurons may correspond closely to ngrams. If this hypothesis were true in the strongest form, it would mean SAEs are not useful for text analysis, as they could be totally replaced with the much easier method of ngram counting.

To test this hypothesis, I obtain ngram counts for words in the text, in particular I use CountVectorizer and an ensemble of a few parameter choices. This is important because using just analyzer=’word’ would miss important details, e.g., one common ngram in AI generated text on my dataset is “. The” (starting sentences with “The”). I then obtain correlation scores for each SAE neuron and each of the 104,830 ngrams in this vocabulary and graph the highest correlation that each SAE neuron has with an ngram (note this is an imprecise method because it compares sequence-level counts/means, but the underlying thing being predicted would be closer to the word or token level).

Figure 3

Compared to all SAE neurons, the top-differentiating neurons have, on average, higher max correlations with an ngram; data is from the Train Set, y-axis is log scaled.

These top-differentiating neurons seem to be more ngram-related than average SAE neurons. Another test that can bear on this relationship between SAE neurons and ngrams is to get the highest correlating ngrams for a given SAE neuron and compare their classification performance against that of the SAE neuron. I do this for the top-16 differentiating SAE neurons. Rather than directly looking at the top k ngrams, I take the top 1024 correlating and apply RFE for classification performance — so this doesn’t represent the result of the top-k correlating ngrams. For many of these, 8 highly correlating ngrams is enough to match their performance, and for a couple, around 70% accuracy can be achieved with just 2 ngrams. I also manually inspect a few of the top SAE neurons.

Generally, these top-differentiating SAE neurons don’t seem particularly easy to understand, but they are interesting. I didn’t put very much effort into interpreting them, nor did I use the best existing techniques, so I expect more time and effort would come up with better explanations. Here are some notes on the 4 best differentiating SAE neurons I find:

One of the goals of this project is to generate human-understandable information from SAE neurons such that we can do AI text detection in a more manual (or automated without AIs) and safe way. I therefore also run Adaptive Thresholding + RFE on ngrams — ngram counting is highly understandable and it would be convenient if a few ngrams could be effectively used for AI text detection.

Figure 4

Top ngram performance on held-out validation set. Using a small number of ngrams underperforms using a small number of SAE neurons, but it is still impressive.

Discussion

Motivation

AI text detection is an important problem

This topic has been discussed at length elsewhere, and I don’t have too much to add: detecting AI generated text and the “digital fingerprint” of AI agents seems beneficial for mitigating negative societal impacts in a world with AIs running around autonomously or being used by malicious actors.

Separately, I think AI text detection, or at least some subsets of it, may be useful for reducing large scale risks when AIs are being run under Control protocols (i.e., without Rogue Deployments). Two particular risks come to mind: collusion, and sandbagging + exploration hacking (which in some sense is a form of collusion).

Collusion could arise from multiple AI systems sharing malicious information, for example an AI who is being evaluated by other AIs trying to bargain with them. Without explicit communication, there are certainly some risks from collusion, but AIs being able to reliably communicate with each other likely makes it easier for them. Therefore, if we can make it difficult for AIs to know whether they are talking to other AIs or human red-teamers, plotting is more difficult. This is relevant to AI text detection from a somewhat counterintuitive angle: creating AI generated text that is indistinguishable from human text might be good here, as it prevents AIs from colluding via direct communication. Studying AI text detection, and the mechanisms AIs themselves use for AI text detection, could help prevent collusion risks.

Sandbagging is a risk where an AI strategically underperforms during evaluation to hide its true capabilities. Exploration hacking is a version of sandbagging where an AI is being trained in a non-supervised setting and the AI is able to hide its capabilities by never taking high-performance actions during training. In a supervised setting, we should be able to disincentivize the cognition that caused the model to sandbag, because we have known high-performance trajectories that we can train on (i.e., push the model’s cognition so it is more likely to pursue them). Successfully pulling off an exploration hacking strategy would require identifying when you are in a supervised or RL setting. In particular, a model might try to identify whether a copy of it, some other model, or humans, are responsible for creating the data it has seen so far in the problem/solution — if all the reasoning so far was human generated, this is more likely to be a supervised setting. Similar to collusion, understanding to what extent AIs can detect human vs. their own vs. other AI generated text may help mitigate this problem. Note that collusion and sandbagging risks are may not actually occur in the real world; it’s important to have empirical work to determine how likely they are so we can prioritize defensive resources.

It might be time for narrow “Microscope AI”

Microscope AI, as introduced by Chris Olah and discussed by Evan Hubinger, is an approach to AI safety in which we build powerful AI systems, but rather than directly running these AIs — which may pose various dangers — we extract knowledge they have learned about the world, and directly apply that knowledge ourselves (or via some safe process). There are two large hurdles with Microscope AI: capabilities (we need AIs that have actually learned novel and interesting insights) and interpretability (we need to be able to extract that knowledge and transform it into human-understandable terms). For most domains or tasks I want to solve, I don’t think the capabilities are there; and the interpretability tools are still in their early stages. However, I think some domains might be a good fit for narrow applications of this approach.  

AI text detection is a particularly good area for this

There are a few reasons AI text detection looks like a good fit for trying Microscope AI:

Potential Next Steps

There are many ways to keep going on this project. In no particular order, I think some of the important ones are:

Considerations in working on this

I won’t continue working on this project, but if somebody else wants to push it forward and wants mentorship, I’m happy to discuss. Here are some reasons to be excited about this research direction:

But ultimately I won’t be working on this more. The main reasons:

Appendix

Random Notes:

  1. ^

     I do not directly compare against state-of-the-art techniques for AI text detection; I am quite confident they would perform better than SAEs.

  2. ^

     I briefly investigated using other classification techniques, specifically DecisionTree, RandomForest, SVC, and KNeighbors. None of these consistently performed better than LogisticRegression for any of the baselines I tried in preliminary testing, and they took longer to run, so I focused on LogisticRegression.

  3. ^

     I think the most obvious baselines I did not run would be fine-tuning RoBERTa, and using GLTR. The paper which introduces the dataset I use claims that various standard methods aren’t very particularly effective zero shot (Table 2, Figure 4), but fine-tuning classifiers is highly effective. These are not a 1-1 comparison to my experiments as I use a small subset of the data, truncate sequences to be fairly short, and only focus on the “Generated” and human subsets.

  4. ^

    There are also plenty of negative results in using AIs for AI text detection. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 文本检测 稀疏自动编码器 GPT-2 可解释性
相关文章