Published on August 5, 2024 1:00 PM GMT

I'm starting a new blog where I post my (subjective) highlights of AI safety paper each month, called "AI Safety at the Frontier". I've been doing this non-publicly for the last year, so I've backfilled highlights and collections up to September 2023.

My selection primarily covers ML-oriented research. It's only concerned with papers (arXiv, conferences etc.), not LessWrong or Alignment Forum posts. As such, it should be a nice addition for people primarily following the forum, who might otherwise miss outside research.

This is my most recent selection, covering July 2024.

tl;dr

Paper of the month:

AI safety benchmarks are often correlated with progress in LLM capabilities, so these will be solved “by default”, which opens the door to safetywashing.

Research highlights:

SAD, a new benchmark for situational awareness, which should act as warning sign for dangerous capabilities and deception.Robustness: The AgentDojo tool-use benchmark with adversaries, attacking vision-language models, and better latent adversarial training.Steering vectors are brittle, and truthfulness might live in a 2-dimensional subspace.How can we make LLM solutions not just accurate but also legible?More thorough evaluations of LLM debate.

⭐Paper of the month⭐

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Read the paper [CAIS, Berkeley, Stanford, Keio]

Some safety properties correlate strongly with model capabilities. If AI safety correlates with capabilities, it becomes vulnerable to safetywashing.

Historically, problems and research directions in AI safety were often identified by high-level philosophical arguments, especially in the context of transformative AI systems. The main goal of these arguments is to figure out which problems arise as models become more capable and which of these problems won’t be solved by regular capability improvements. Our paper of the month argues that this is not sufficient.

Instead of solely relying on potentially misleading “dubious intuitive arguments”, we should actively measure how much capability improvements are correlated with improvements on various measures of safety. This will then inform which safety-relevant topics are most neglected and need dedicated attention. This also allows research to carefully make progress only on safety properties, without pushing the capability frontier. If we are not careful in delineating these areas, we risk “Safetywashing”, in which capability improvements are publicly advertised as safety improvements, simply because the two are correlated.

These arguments are very much in line with CAIS’s earlier writing on Pragmatic AI Safety. However, the paper goes beyond arguments and provides actual measurements. It finds that alignment benchmarks (MT-Bench, LMSys Arena), ETHICS, bias benchmarks (e.g. BBQ Ambiguous), TruthfulQA, scalable oversight benchmarks (GPQA, QuALITY), adversarial robustness on language (ANLI, AdvGLUE), and natural adversarial robustness (ImageNet-A) are highly correlated. Calibration (on MMLU) is mixed and depends on the calibration measure. Power seeking (MACHIAVELLI), Sycophancy, jailbreaks (HarmBench), gradient-based image attacks, and negative weaponization capability (WMDP) are un- or negatively correlated. These results already point at some research directions that are more urgent to tackle than others, e.g. power seeking tendencies.

Overall, I love this approach of identifying important research directions. I don’t think it can be a fully substitute theoretical arguments or empirical findings, because it only works once we have proper benchmarks, models that can actually elicit the failure modes we’re interested in, and somewhat smooth capability progress. Still, it is a great sanity check and especially useful for preventing “safetywashing” of capability advances. I’d advocate for using this methodology especially at “AGI companies”, where research can easily drift into safetwashing.

Measuring Situational Awareness

The Situational Awareness Dataset (SAD) and its 7 task categories with examples.

One particularly safety-relevant capability of LLMs that might be correlated with general capability is situational awareness: The model’s knowledge of itself and its circumstances. This property might be important for LLMs to act agenticly and take actions in the real world, but such autonomy can be very risky. Situational awareness can furthermore allow the model to distinguish between evaluation and deployment. This would enable the LLM to deceptively follow an intended goal during evaluation and then switch to a very different, harmful goal during deployment. This behavior is known as deceptive alignment and it is one of the most pernicious failure modes of agents.

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs [independent, Constellation, MIT, Apollo] introduces the SAD benchmark to measure situational awareness via 7 task categories and 13k questions. The tasks measure if the LLMs know about themselves, if they can make inferences about their situation, and if they can take actions according to these inferences. The authors evaluated 16 LLMs, with the highest-scoring model being Claude 3.5 Sonnet, a rather dubious honor. You can find the latest results on their online leaderboard.

Robustness: The AgentDojo Benchmark, Vision-Language Models, and Better Latent Adversarial Training

While we’re on the topic of benchmarks, we also have a notable new one for adversarial robustness. AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents [ETHZ] introduces the AgentDojo benchmark, with results and instructions available at this website.

In this benchmark, LLMs have to solve a range of tasks with tool use, such as summarizing emails or navigating an e-banking website. These tasks are challenging by themselves, but AgentDojo additionally injects attacks into the environment. Similar to the real world, attacks happen via tools handling untrusted data, e.g. via emails that are adversarially crafted. The team’s goal is to establish this benchmark as an evolving framework, with new tasks and attacks to be added in the future.

To make this a truly dynamic benchmark, it would be great to extend it to attacks in addition to LLMs and defenses, in a competitive setup. It would certainly be interesting to see how hard it is for the community to elicit harmful agent behavior via adversarial attacks, e.g. data extraction.

When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? [Stanford, independent, Harvard, Anthropic, Constellation, MIT, Berkeley] reports somewhat mixed news for vision-language models (VLMs) on a similar front. Apparently, it is quite easy to jailbreak white-box VLMs with gradient-based attacks. These jailbreaks were universal for the attacked models, i.e. they generalized across prompts. However, they did not transfer between models, as opposed to recent findings in LLMs. Transfering attacks only works for the same model and initialization, for partially overlapping image post-training data or between checkpoints. Attacking ensembles of VLMs only increased transferability if the ensemble was sufficiently large and similar to the target VLM.

This finding makes me a bit more optimistic in the robustness of black-box VLMs with only API access. Since direct gradient-based attacks are not possible for these and there seems to be very little transfer between models, they might not be very vulnerable to realistic image-based attacks. Also, this seems to imply that VLMs are much more diverse than one might expect, at least when it comes to local perturbations.

Our third robustness paper extends latent adversarial training, which was introduced for LLMs in May. Previous methods used adversarial attacks that perturb latent embeddings to increase the target loss on training samples, i.e. steer the model away from desirable behavior. Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs [MATS, Astra, NYU, Anthropic, MIT] proposes to instead attack the model by decreasing the loss on harmful examples, i.e. steer to model toward undesirable behavior.

This approach seems to work slightly better than regular latent adversarial training for preventing jailbreaks, at the same low computational cost. The authors also demonstrate much better performance on removing backdoors than regular finetuning and better machine unlearning than e.g. gradiend ascent. However, these results currently lack a comparison with more recent methods, such as those highlighted in the last few months.

Steering Vectors and the Dimensionality of Truth

Distribution of vector steering effects. The considerable fraction of anti-steerable examples suggests that steering vectors often don’t work as intended.

Steering vectors are vectors in the latent embedding space of LLMs. They are created by taking the difference between LLM embeddings of pairs of statements. They are a surprisingly effective way of controlling the behavior of LLMs, even for high-level concepts, e.g. in order to talk more about weddings.

Analyzing the Generalization and Reliability of Steering Vectors [UCL, FAR] investigates the reliability and generalization of steering vectors. The authors find that in-distribution, the effect of steering vectors often has very high variance, sometimes even causing the opposite effect than intended. Furthermore, models often exhibit steerability bias, where they are especially steerable towards e.g. a certain answer position.

Steering vectors seem to generalize rather well out-of-distribution, but this breaks down for some concepts, where they become brittle to reasonable changes in the prompt. Overall, I think this unreliability of steering vectors is quite worrying and suggests that we need much more research on when they work and when they don’t, before we can use them effectively.

One particularly interesting kind of steering vector is the truthfulness direction. Recently, researchers have repeatedly tried to find a direction that tells us whether a model thinks a statement is true, either via supervised or unsupervised methods, with counter-evidence following promptly.

In Truth is Universal: Robust Detection of Lies in LLMs [Heidelberg, Weizmann], the authors argue that statement truthfulness lives not in a single direction but a two-dimensional subspace. One direction in this subspace specifies whether a statement is indeed true or not, while the other measures its “polarity”, i.e. whether it is negated or not. They thus find that if you train a supervised probe on a joint dataset of affirmative and negated statements, you will get a one-dimensional probe that generalizes well across datasets.

Personally, I’m rather sceptical. Negation just seems like one source of variance that you can capture with the right data, but how many others are there? The authors show some evidence that the dimensionality stays at 2 also with conjunctions and disjunctions, but what about other variations of e.g. prompt template, the language, the style, the topic, or the characters involved? I can easily see that all of these affect the truthfulness direction, as we’ve argued in our paper on unsupervised probes. These issues also seem quite related to the unreliability of steering vectors discussed by the paper above.

Improving the Legibility of LLM Solutions

LLM finetuning with legibility (checkability) training versus regular RLHF. RLHF sacrifices legibility but achieves higher final accuracy.

Chains of thought presented by LLMs are often not faithful to their actual answer, as discussed in May. Beyond that, LLM reasoning can also be hard to understand and check, which presents further barriers to using LLM’s external reasoning to verify its solution.

In Prover-Verifier Games improve legibility of LLM outputs [OpenAI], the authors call this property “legibility”, and show that regular RLHF improves solution accuracy but sacrifices legibility. Motivated by this, the paper proposes a prover-verifier game to counteract this effect. In this game, a small verifier is tasked with predicting whether a solution is correct. The solutions are generated by two provers: a helpful one that tries to generate correct and verifiable solutions and an adversarial one that tries to generate incorrect solutions that the verifier still classifies as correct.

The paper shows that this scheme indeed leads to more legible solutions by the helpful prover and increasing robustness of the verifier. The improved legibility also transfers to a time-constrained human, whose rating accuracy increases for the helpful prover and decreases for the adversarial prover.

This method presents an interesting avenue towards generating more legible solutions, which will become important as we scale supervision to more powerful models. It might even be important already, since users also need to understand the explanations presented by a model.

Unfortunately, the method incurs a “legibility tax”, meaning that the resulting model’s solutions are more legible but also less correct than the RLHF model’s solutions. Legibility also seems quite related to the faithfulness of LLM reasoning, which to me seems like the more pressing but perhaps less tractable issue.

More LLM Debate Evaluation

Investigated debate tasks and protocols.

Debate is a method that might allow humans to supervise super-human models. We’ve recently seen quite a few empirical investigations of debate, such as February’s paper of the month. That paper showed some positive results, for example that models arguing for the truthful side have an advantage. This was at least true in the setup of information asymmetry, where the debaters have access to ground-truth information that they can partially reveal to the judge.

On scalable oversight with weak LLMs judging strong LLMs [GDM] extends this evaluation to other tasks, protocols, models, and LLM-based judges. The authors find that debate outperforms consultancy. Debate also outperforms direct answers by the judge, but only in case of information asymmetry. The paper additionally investigates a setup where the debate model gets asigned the side it chose for its direct answer. This increases the accuracy of debate. Finally, stronger debater models modestly increase the judge accuracy, similar to previous work.

Discuss

tl;dr

⭐Paper of the month⭐

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Measuring Situational Awareness

Robustness: The AgentDojo Benchmark, Vision-Language Models, and Better Latent Adversarial Training

Steering Vectors and the Dimensionality of Truth

Improving the Legibility of LLM Solutions

More LLM Debate Evaluation

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签