少点错误 2024年09月04日
AI Safety at the Frontier: Paper Highlights, August '24
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

这篇博文总结了2024年8月AI安全研究的亮点,主要关注机器学习领域,涵盖了arXiv和会议论文等。文章重点介绍了GemmaScope,它提供了一套开放的稀疏自动编码器(SAE)用于Gemma模型,以及一些针对开放模型的安全措施,例如对抗性微调和数据中毒防御。

🤩 **GemmaScope:开放稀疏自动编码器(SAE)** GemmaScope是一个大型的开放SAE数据集,包含了2.6B、9B和27B参数的模型,涵盖了多个层级,并支持不同的SAE宽度。该数据集提供了丰富的研究机会,可以帮助研究人员深入理解神经网络的内部机制。 GemmaScope的出现为研究人员提供了更大的研究空间,可以更深入地研究神经网络的内部机制,特别是对于稀疏自动编码器的研究,它可以帮助我们更好地理解模型的内部工作原理。

🤔 **SAE的评估和局限性** 虽然SAE在可解释性方面取得了进展,但一些研究表明,SAE可能无法完全捕捉到模型的全部信息。例如,一些研究发现,RNN模型中某些特征的表示方式并非线性,而SAE主要依赖于线性表示假设。 SAE的局限性提醒我们,在使用SAE进行可解释性研究时,需要谨慎对待,并结合其他方法进行验证。

🛡️ **对抗性微调和数据中毒防御** 针对开放模型的对抗性微调攻击,一些研究人员提出了一些防御措施,例如基于元学习的防御方法。然而,这些防御措施仍然存在局限性,例如对不同攻击方法的鲁棒性不足。 对抗性微调和数据中毒攻击是开放模型面临的重要安全挑战,需要进一步研究更有效的防御措施,确保模型的安全性和可靠性。

🗣️ **多轮对话攻击** 一些研究发现,通过多轮对话,攻击者可以绕过模型的安全措施,例如通过引导模型进行有害的对话。这表明,传统的基于输入的防御措施可能不足以应对更复杂的多轮攻击。 多轮对话攻击的出现,说明了AI安全研究需要更加关注模型在复杂交互场景下的安全性,例如多轮对话、多模态交互等。

📊 **AI风险分类体系** 一些研究人员尝试建立一个统一的AI风险分类体系,以便更好地理解和管理AI风险。该体系通过分析已有文献中提出的各种风险分类方法,将风险分类为不同的类别,例如实体、意图和时间等。 AI风险分类体系的建立,可以帮助研究人员更好地理解和管理AI风险,并制定更有效的安全措施。

Published on September 3, 2024 7:17 PM GMT

This is a selection of AI safety paper highlights in August 2024, from my blog "AI Safety at the Frontier". The selection primarily covers ML-oriented research. It's only concerned with papers (arXiv, conferences etc.), not LessWrong or Alignment Forum posts. As such, it should be a nice addition for people primarily following the forum, who might otherwise miss outside research.

tl;dr

Paper of the month:

Gemma Scope provides a diverse set of open SAEs on Gemma models, some of them covering all layers.

Research highlights:

⭐Paper of the month⭐

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Read the paper [GDM], explore Neuronpedia [independent]

Neuronpedia’s “Microscope” tool shows which of Gemma’s features activate at each token.

Sparse autoencoders (SAEs) have dominated the field of mechanistic interpretability in the last year, from last October to May. Many researchers are convinced that they represent a step change in the journey towards decipher neural networks. Unfortunately, SAEs are very expensive to train, since they require running large models on gigantic amounts of text. So this tool ultimately remained outside reach for most researchers, causing many to instead focus on single neurons and probes.

Our paper of the month changes this. While there were some small public SAE weights before, Gemma Scope is larger by multiple orders of magnitude. It includes models with 2.6B, 9B, and 27B parameters, pretrained and instruction-tuned, at many different layers, and with multiple SAE widths. For some models and SAE widths the dataset even includes all model layers.

Methodologically, there isn’t really anything new here. The SAEs use JumpReLU activations and an L0 loss, as proposed in earlier work. This approach is similarly effective as k-sparse autoencoders.

This dataset presents a great opportunity for many different kinds of interesting research, and I’m sure there is plenty of low-hanging fruit to discover. I hope that many researchers take this opportunity and that Google and others continue to release SAEs of good models so we get more people to help solve the tough challenges in interpretability.

Measuring & Criticizing SAEs

Some boardgame state properties that SAEs should capture: player-owned knight, rook threats queen, piece pinned.

While we’re on the topic of SAEs, Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models [independent, MIT, UMass, Mannheim, Harvard, NU] proposed a way to improve the evaluation of SAEs and their training methods. Boardgames contain a natural set of interpretable ground-truth features—as opposed to natural language, where we don’t know which features a model should represent. This setting allows us to go beyond the usual proxy metrics like reconstruction fidelity and sparsity, and instead actually look at how meaningful the detected features are.

The paper also proposes p-annealing, a method to improve SAEs. The associated results are missing many improvements that came out since the paper’s conference submission. It would now be interesting to see whether p-annealing can improve SAEs beyond JumpReLUs or k-sparse autoencoders.

Now, to dampen the SAE hype a little we should highlight that SAEs centrally rely on the linear representation hypothesis (LRH). In Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations [Stanford, Pr(Ai)²R], the authors train an RNN to repeat an input sequence. When looking at how the model represents the number of repetitions it has to do, the authors found that it represents this as the magnitude rather than the direction of the feature. The activated directions as captured by SAEs thus don’t tell the whole story.

This is similar to earlier results showing that some features lie on a circular pattern. It thus seems plausible that SAEs capture an important part, but not the full picture. The landscape of representations and how they relate to each other seems crucial. A similar point was recently argued for in this post.

Robustness against Finetuning

Tamper-resistance fine-tuning: At each step we perform a fine-tuning attack on the model (inner loop, insets) and then update the model weights to be robust to this fine-tuning (outer loop).

If we allow model access via APIs like Anthropic and OpenAI do, our models need to be able to reject harmful requests, such as giving instructions on how to build a bomb. This becomes increasingly important as models become more powerful and is already a formidable task with lots of ongoing research. However, if we release model weights like Meta does but still want to be responsible, we are essentially facing an additional challenge: What if users try to finetune away our safeguards.

As discussed in May, doing so is currently trivial but some researchers are trying to change this. Most recently, Tamper-Resistant Safeguards for Open-Weight LLMs [Lapis Labs, CAIS, Gray Swan AI, CMU, Harvard, Berkeley, UIUC] showed substantial improvements compared to previous methods. The authors achieve this by framing this setup as a meta-learning task. At each step, they “attack” model weights, typically by finetuning on harmful examples. They then approximate backpropagating through the attack and update the model’s gradients to be less susceptible to the attack.

In order to make this work, the authors use an unusual entropy-based loss and a representation-based retain loss that conserves harmless abilities. While better than previous methods, the method still leaves some things to be desired. It breaks quickly when a different fine-tuning method such as LoRA is used. Despite a small experiment in the appendix, it remains unclear how much worse finetunability on harmless data becomes, and it is equally unclear how well the defense generalizes to different harmful finetuning data. These aspects respresent the usual pain points of robustness work: Novel attacks, new scenarios, and false positives.

Breaking All the Models

Attack success rate of multi-turn jailbreaks and an example conversation.

Defending against regular input-based jailbreaks is a piece of cake compared to defending against fine-tuning, . Does that mean we’ve solved it? No, far from it, as multiple papers show again this month.

In LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet [Scale AI, Berkeley], the authors jailbreak models with human-created multi-turn conversations. Previous jailbreaks were mostly focused on prompts that directly jailbreak a model, either by manually crafting the prompt or by optimizing prefixes and suffixes. The authors set a pipeline of multiple red-teamers who get 30 minutes each to create a jailbreak that is then validated.

This approach achieves an attack success rate of above 50% for all investigated models, while using only a chat API. Previous, automated methods achieved less than 10% or even 0% on some of these models.

Fluent Student-Teacher Redteaming [Confirm Labs] proposes an automated gradient-based jailbreak. Previous methods typically trained jailbreaking prefixes and suffixes by increasing the probability of confirming start in the response such as “Sure, here is”. This work instead first fine-tunes a jailbroken “evil twin” (my term, not theirs). The authors then optimize prompt suffixes to minimize the distance between the regular model’s representations and the evil twin’s representations. They additionally make the suffixes more human-readable and fluent by minimizing a perplexity loss calculated over an ensemble of models and a token repetition loss.

The resulting method achieves an attack success rate of >90% for multiple models, and is able to find a universal jailbreak that breaks API-only GPT models with a rate >10%. The authors also published an associated blog earlier with some anecdotal jailbreaks for CygNet.

Data Poisoning

Vulnerability to data poisoning (higher y=more vulnerable) by model size. Despite the noise, the overall correlation is statistically significant at p<0.001, p<0.001, and p=0.003.

Another interesting attack vector is data poisoning. In this scenario, the attacker inserts manipulated training data into the model’s training corpus. Since LLMs are pretrained on large parts of the internet, attackers obviously have plenty of opportunities to manipulate the training data. A few such data points can create a backdoor that drastically breaks model safeguards in select, dangerous cases.

Scaling Laws for Data Poisoning in LLMs [FAR AI, Berkeley] find that large models unfortunately do not become more robust to these attacks as they become larger. On the contrary, large models learn faster from few samples and are thus more vulnerable to backdoors. The authors find this by analyzing data poisoning across 23 LLMs with 1.5-72B parameters.

What can we do then? Many different defenses have been proposed, typically based on detecting backdoors and then filtering them out. However, Protecting against simultaneous data poisoning attacks [Cambridge, MPI] finds that previous defenses are not robust in the difficult but realistic scenario of attackers using multiple different attacks to manipulate training data. They then propose the BaDLoss data filtering method. BaDLoss observes the loss trajectories of a set of clean examples and then filters out all training examples whose loss curves are anomalous compared to this set. This results in a training process that is much more robust to simultaneous attacks.

This paper is focused on the image domain, but the method does hint at a possible way of detecting backdoors during training, also for LLMs and large multi-modal models (LMMs). Finding stronger defenses against backdoors seems like a crucial direction for further research.

Unifying AI Risk Taxonomies

Process behind the meta-analysis of AI risk taxonomies (or frameworks).

Researchers have proposed dozens of taxonomies for AI risks in previous literature. The AI Risk Repository: A Comprehensive Meta-Review, Database, and Taxonomy of Risks From Artificial Intelligence [MIT, UQ, Ready Research, FLI, Harmony Intelligence, MIT] created a Google sheet collecting 43 taxonomies and the 777 risks identified by them. The authors then create the 44th and 45th taxonomies, aiming to unify them all. The first taxonomy classifies each risk by its causal factors: entity, intent, and timing. The second one classifies AI risks by their domain, e.g. misinformation or AI system safety, failures & limitations.

I’m sure people will keep inventing new taxonomies, but this meta-analysis at least provides a current overview.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 稀疏自动编码器 对抗性微调 数据中毒 多轮对话攻击
相关文章