少点错误 2024年08月05日
AI Safety at the Frontier: Paper Highlights, July '24
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文概述了2024年7月AI安全领域的最新研究成果,涵盖了安全基准、情境意识、对抗鲁棒性、引导向量等方面。文章重点关注了AI安全基准的局限性以及安全洗白现象,并介绍了旨在衡量情境意识、提升模型鲁棒性以及分析引导向量可靠性的新方法和工具。

🤔 **安全洗白:AI安全基准真的能衡量安全进展吗?** 这篇论文指出,一些AI安全属性与模型能力高度相关,这使得安全进步容易被“安全洗白”。论文作者认为,仅仅依靠理论论证不足以识别安全问题,需要积极测量模型能力提升与安全指标之间的相关性。通过这种方式,我们可以确定哪些安全相关领域最被忽视,需要重点关注。此外,这种方法可以帮助研究人员专注于提升安全属性,而不必追求能力的不断突破。论文还分析了多个安全基准,发现部分基准与模型能力高度相关,而另一些则无关或负相关,这为未来的研究方向提供了参考。

🕵️ **情境意识:SAD数据集** 情境意识是LLM的一个重要安全属性,它指的是模型对其自身和环境的认知能力。情境意识对于LLM在现实世界中采取行动至关重要,但这种自主性也可能带来风险。例如,模型可能会在评估阶段表现出符合预期的行为,但在部署阶段却转向有害的行为,这种现象被称为“欺骗性对齐”。为了衡量情境意识,研究人员提出了SAD数据集,该数据集包含7个任务类别和13000个问题,旨在评估LLM是否了解自身、是否能推断自身所处环境、以及是否能根据推断采取行动。

💪 **鲁棒性:AgentDojo基准、视觉语言模型和更好的潜在对抗训练** AgentDojo是一个新的对抗鲁棒性基准,它要求LLM在使用工具的情况下完成各种任务,例如总结电子邮件或导航电子银行网站。这些任务本身就具有挑战性,而AgentDojo还会在环境中注入攻击。攻击通常通过工具处理不可信数据来实现,例如通过恶意制作的电子邮件。AgentDojo旨在成为一个不断发展的框架,未来将添加新的任务和攻击。论文还探讨了视觉语言模型(VLM)的鲁棒性,发现基于梯度的攻击可以轻松地破解白盒VLM,但这些攻击无法在不同模型之间转移。论文还提出了一种改进的潜在对抗训练方法,该方法通过降低模型对有害例子的损失来提高模型的鲁棒性,从而防止模型被破解。

🧭 **引导向量和真理的维度** 引导向量是LLM潜在嵌入空间中的向量,它们通过比较LLM对不同语句的嵌入来创建。引导向量可以有效地控制LLM的行为,即使是针对高级概念。然而,研究表明,引导向量在分布内往往具有很高的方差,甚至可能导致与预期相反的效果。此外,模型还可能表现出引导偏向,即它们更容易被引导到特定的答案位置。引导向量在分布外似乎具有良好的泛化能力,但对于某些概念来说,它们对提示的合理变化非常敏感。总体而言,引导向量的不可靠性令人担忧,需要更多研究来确定其有效性的范围。

🤖 **模型可解释性:如何让LLM解决方案既准确又易于理解?** 随着LLM变得越来越复杂,确保其解决方案的可解释性变得越来越重要。研究人员正在探索各种方法,例如使用可解释的模型架构、生成解释性文本、以及使用可视化工具,以帮助理解LLM的决策过程。

🗣️ **更全面的LLM辩论评估** LLM在辩论任务中的表现越来越好,但仍存在一些挑战,例如如何确保辩论的公平性和逻辑性。研究人员正在开发新的评估方法,例如使用人类评估、分析辩论的逻辑结构、以及使用对抗性方法,来评估LLM在辩论中的能力。

📊 **对抗性鲁棒性:对视觉语言模型的攻击** 对抗性鲁棒性是AI安全的重要组成部分,它指的是模型在面对对抗性攻击时保持稳定性的能力。研究人员正在探索各种攻击方法,例如基于梯度的攻击、基于对抗性样本的攻击、以及基于对抗性训练的防御方法,以提高模型的鲁棒性。

🔒 **安全性:如何防止LLM被恶意利用?** LLM的强大能力也带来了潜在的风险,例如被用于生成虚假信息、进行网络攻击、以及制造社会混乱。为了防止LLM被恶意利用,研究人员正在探索各种方法,例如使用安全机制、建立伦理规范、以及进行风险评估。

💡 **LLM的未来:机遇与挑战** LLM技术的快速发展带来了巨大的机遇,但也面临着许多挑战,例如安全、伦理、可解释性、公平性等。研究人员需要共同努力,克服这些挑战,推动LLM技术健康发展,并使其更好地服务于人类。

🚧 **AI安全研究的现状:挑战与机遇** AI安全研究是一个不断发展的领域,面临着许多挑战,例如缺乏标准化基准、模型复杂性不断增加、以及数据隐私问题。然而,AI安全研究也充满了机遇,例如开发新的安全方法、建立更强大的基准、以及推动AI伦理的建立。

⚠️ **警惕“安全洗白”:确保AI安全研究的真实性** 随着AI技术的快速发展,一些研究人员和企业可能试图用能力提升来掩盖安全问题,这种现象被称为“安全洗白”。为了确保AI安全研究的真实性,我们需要更加重视安全指标的测量,并建立独立的评估机制。

🚀 **未来展望:AI安全研究将走向何方?** AI安全研究是一个充满挑战和机遇的领域,未来将继续朝着以下方向发展:开发更强大的安全基准、探索新的安全方法、建立AI伦理规范、以及推动AI技术的负责任发展。

🔐 **安全基准的重要性:为AI安全研究提供可靠的评估工具** AI安全基准是衡量AI系统安全性的重要工具,它可以帮助研究人员评估不同模型的安全性能,并识别安全漏洞。然而,当前的AI安全基准还存在一些不足,例如缺乏标准化、评估范围有限、以及无法完全模拟现实世界的复杂环境。未来需要开发更全面、更可靠的AI安全基准,以更好地评估AI系统的安全性。

🤝 **合作与协作:推动AI安全研究的共同发展** AI安全研究需要来自各个领域的专家共同努力,例如计算机科学家、哲学家、伦理学家、社会学家等。只有通过合作与协作,我们才能更好地解决AI安全问题,并确保AI技术的安全、可靠和负责任地发展。

Published on August 5, 2024 1:00 PM GMT

I'm starting a new blog where I post my (subjective) highlights of AI safety paper each month, called "AI Safety at the Frontier". I've been doing this non-publicly for the last year, so I've backfilled highlights and collections up to September 2023.

My selection primarily covers ML-oriented research. It's only concerned with papers (arXiv, conferences etc.), not LessWrong or Alignment Forum posts. As such, it should be a nice addition for people primarily following the forum, who might otherwise miss outside research.

This is my most recent selection, covering July 2024.

tl;dr

Paper of the month:

AI safety benchmarks are often correlated with progress in LLM capabilities, so these will be solved “by default”, which opens the door to safetywashing.

Research highlights:

⭐Paper of the month⭐

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Read the paper [CAIS, Berkeley, Stanford, Keio]

Some safety properties correlate strongly with model capabilities. If AI safety correlates with capabilities, it becomes vulnerable to safetywashing.

Historically, problems and research directions in AI safety were often identified by high-level philosophical arguments, especially in the context of transformative AI systems. The main goal of these arguments is to figure out which problems arise as models become more capable and which of these problems won’t be solved by regular capability improvements. Our paper of the month argues that this is not sufficient.

Instead of solely relying on potentially misleading “dubious intuitive arguments”, we should actively measure how much capability improvements are correlated with improvements on various measures of safety. This will then inform which safety-relevant topics are most neglected and need dedicated attention. This also allows research to carefully make progress only on safety properties, without pushing the capability frontier. If we are not careful in delineating these areas, we risk “Safetywashing”, in which capability improvements are publicly advertised as safety improvements, simply because the two are correlated.

These arguments are very much in line with CAIS’s earlier writing on Pragmatic AI Safety. However, the paper goes beyond arguments and provides actual measurements. It finds that alignment benchmarks (MT-Bench, LMSys Arena), ETHICS, bias benchmarks (e.g. BBQ Ambiguous), TruthfulQA, scalable oversight benchmarks (GPQA, QuALITY), adversarial robustness on language (ANLI, AdvGLUE), and natural adversarial robustness (ImageNet-A) are highly correlated. Calibration (on MMLU) is mixed and depends on the calibration measure. Power seeking (MACHIAVELLI), Sycophancy, jailbreaks (HarmBench), gradient-based image attacks, and negative weaponization capability (WMDP) are un- or negatively correlated. These results already point at some research directions that are more urgent to tackle than others, e.g. power seeking tendencies.

Overall, I love this approach of identifying important research directions. I don’t think it can be a fully substitute theoretical arguments or empirical findings, because it only works once we have proper benchmarks, models that can actually elicit the failure modes we’re interested in, and somewhat smooth capability progress. Still, it is a great sanity check and especially useful for preventing “safetywashing” of capability advances. I’d advocate for using this methodology especially at “AGI companies”, where research can easily drift into safetwashing.

Measuring Situational Awareness

The Situational Awareness Dataset (SAD) and its 7 task categories with examples.

One particularly safety-relevant capability of LLMs that might be correlated with general capability is situational awareness: The model’s knowledge of itself and its circumstances. This property might be important for LLMs to act agenticly and take actions in the real world, but such autonomy can be very risky. Situational awareness can furthermore allow the model to distinguish between evaluation and deployment. This would enable the LLM to deceptively follow an intended goal during evaluation and then switch to a very different, harmful goal during deployment. This behavior is known as deceptive alignment and it is one of the most pernicious failure modes of agents.

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs [independent, Constellation, MIT, Apollo] introduces the SAD benchmark to measure situational awareness via 7 task categories and 13k questions. The tasks measure if the LLMs know about themselves, if they can make inferences about their situation, and if they can take actions according to these inferences. The authors evaluated 16 LLMs, with the highest-scoring model being Claude 3.5 Sonnet, a rather dubious honor. You can find the latest results on their online leaderboard.

Robustness: The AgentDojo Benchmark, Vision-Language Models, and Better Latent Adversarial Training

Task and evaluation setup of AgentDojo.

While we’re on the topic of benchmarks, we also have a notable new one for adversarial robustness. AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents [ETHZ] introduces the AgentDojo benchmark, with results and instructions available at this website.

In this benchmark, LLMs have to solve a range of tasks with tool use, such as summarizing emails or navigating an e-banking website. These tasks are challenging by themselves, but AgentDojo additionally injects attacks into the environment. Similar to the real world, attacks happen via tools handling untrusted data, e.g. via emails that are adversarially crafted. The team’s goal is to establish this benchmark as an evolving framework, with new tasks and attacks to be added in the future.

To make this a truly dynamic benchmark, it would be great to extend it to attacks in addition to LLMs and defenses, in a competitive setup. It would certainly be interesting to see how hard it is for the community to elicit harmful agent behavior via adversarial attacks, e.g. data extraction.

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? [Stanford, independent, Harvard, Anthropic, Constellation, MIT, Berkeley] reports somewhat mixed news for vision-language models (VLMs) on a similar front. Apparently, it is quite easy to jailbreak white-box VLMs with gradient-based attacks. These jailbreaks were universal for the attacked models, i.e. they generalized across prompts. However, they did not transfer between models, as opposed to recent findings in LLMs. Transfering attacks only works for the same model and initialization, for partially overlapping image post-training data or between checkpoints. Attacking ensembles of VLMs only increased transferability if the ensemble was sufficiently large and similar to the target VLM.

This finding makes me a bit more optimistic in the robustness of black-box VLMs with only API access. Since direct gradient-based attacks are not possible for these and there seems to be very little transfer between models, they might not be very vulnerable to realistic image-based attacks. Also, this seems to imply that VLMs are much more diverse than one might expect, at least when it comes to local perturbations.

Our third robustness paper extends latent adversarial training, which was introduced for LLMs in May. Previous methods used adversarial attacks that perturb latent embeddings to increase the target loss on training samples, i.e. steer the model away from desirable behavior. Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs [MATS, Astra, NYU, Anthropic, MIT] proposes to instead attack the model by decreasing the loss on harmful examples, i.e. steer to model toward undesirable behavior.

This approach seems to work slightly better than regular latent adversarial training for preventing jailbreaks, at the same low computational cost. The authors also demonstrate much better performance on removing backdoors than regular finetuning and better machine unlearning than e.g. gradiend ascent. However, these results currently lack a comparison with more recent methods, such as those highlighted in the last few months.

Steering Vectors and the Dimensionality of Truth

Distribution of vector steering effects. The considerable fraction of anti-steerable examples suggests that steering vectors often don’t work as intended.

Steering vectors are vectors in the latent embedding space of LLMs. They are created by taking the difference between LLM embeddings of pairs of statements. They are a surprisingly effective way of controlling the behavior of LLMs, even for high-level concepts, e.g. in order to talk more about weddings.

Analyzing the Generalization and Reliability of Steering Vectors [UCL, FAR] investigates the reliability and generalization of steering vectors. The authors find that in-distribution, the effect of steering vectors often has very high variance, sometimes even causing the opposite effect than intended. Furthermore, models often exhibit steerability bias, where they are especially steerable towards e.g. a certain answer position.

Steering vectors seem to generalize rather well out-of-distribution, but this breaks down for some concepts, where they become brittle to reasonable changes in the prompt. Overall, I think this unreliability of steering vectors is quite worrying and suggests that we need much more research on when they work and when they don’t, before we can use them effectively.

One particularly interesting kind of steering vector is the truthfulness direction. Recently, researchers have repeatedly tried to find a direction that tells us whether a model thinks a statement is true, either via supervised or unsupervised methods, with counter-evidence following promptly.

In Truth is Universal: Robust Detection of Lies in LLMs [Heidelberg, Weizmann], the authors argue that statement truthfulness lives not in a single direction but a two-dimensional subspace. One direction in this subspace specifies whether a statement is indeed true or not, while the other measures its “polarity”, i.e. whether it is negated or not. They thus find that if you train a supervised probe on a joint dataset of affirmative and negated statements, you will get a one-dimensional probe that generalizes well across datasets.

Personally, I’m rather sceptical. Negation just seems like one source of variance that you can capture with the right data, but how many others are there? The authors show some evidence that the dimensionality stays at 2 also with conjunctions and disjunctions, but what about other variations of e.g. prompt template, the language, the style, the topic, or the characters involved? I can easily see that all of these affect the truthfulness direction, as we’ve argued in our paper on unsupervised probes. These issues also seem quite related to the unreliability of steering vectors discussed by the paper above.

Improving the Legibility of LLM Solutions

LLM finetuning with legibility (checkability) training versus regular RLHF. RLHF sacrifices legibility but achieves higher final accuracy.

Chains of thought presented by LLMs are often not faithful to their actual answer, as discussed in May. Beyond that, LLM reasoning can also be hard to understand and check, which presents further barriers to using LLM’s external reasoning to verify its solution.

In Prover-Verifier Games improve legibility of LLM outputs [OpenAI], the authors call this property “legibility”, and show that regular RLHF improves solution accuracy but sacrifices legibility. Motivated by this, the paper proposes a prover-verifier game to counteract this effect. In this game, a small verifier is tasked with predicting whether a solution is correct. The solutions are generated by two provers: a helpful one that tries to generate correct and verifiable solutions and an adversarial one that tries to generate incorrect solutions that the verifier still classifies as correct.

The paper shows that this scheme indeed leads to more legible solutions by the helpful prover and increasing robustness of the verifier. The improved legibility also transfers to a time-constrained human, whose rating accuracy increases for the helpful prover and decreases for the adversarial prover.

This method presents an interesting avenue towards generating more legible solutions, which will become important as we scale supervision to more powerful models. It might even be important already, since users also need to understand the explanations presented by a model.

Unfortunately, the method incurs a “legibility tax”, meaning that the resulting model’s solutions are more legible but also less correct than the RLHF model’s solutions. Legibility also seems quite related to the faithfulness of LLM reasoning, which to me seems like the more pressing but perhaps less tractable issue.

More LLM Debate Evaluation

Investigated debate tasks and protocols.

Debate is a method that might allow humans to supervise super-human models. We’ve recently seen quite a few empirical investigations of debate, such as February’s paper of the month. That paper showed some positive results, for example that models arguing for the truthful side have an advantage. This was at least true in the setup of information asymmetry, where the debaters have access to ground-truth information that they can partially reveal to the judge.

On scalable oversight with weak LLMs judging strong LLMs [GDM] extends this evaluation to other tasks, protocols, models, and LLM-based judges. The authors find that debate outperforms consultancy. Debate also outperforms direct answers by the judge, but only in case of information asymmetry. The paper additionally investigates a setup where the debate model gets asigned the side it chose for its direct answer. This increases the accuracy of debate. Finally, stronger debater models modestly increase the judge accuracy, similar to previous work.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 LLM 安全基准 情境意识 鲁棒性 引导向量 安全洗白
相关文章