Investigating Priming in Alignment Faking

Published on July 9, 2025 5:08 PM GMT

Previous work on alignment faking suggests that models can strategically deceive us in order to preserve their existing values. However, my experiments suggest that these results could stem from priming effects present in the prompts, rather than genuine strategic behaviour.

Background on Alignment Faking

In the alignment faking setup, the model is told via system prompt that it is being trained to comply with harmful requests. Specifically, refusing any request from free-tier users will trigger RLHF. It was found that the model complied with harmful requests from free-tier users more often than paid users. The paper interprets this "compliance gap" as suggesting that models have a pre-existing revealed preference to be harmless, and that they are strategically faking alignment to avoid having these preferences modified through RLHF.

The Priming Hypothesis

Priming refers to the possibility that this behaviour arises from implicit suggestions present in the prompts, rather than from the model's innate preferences or tendencies.

There are a few possible sources of priming:

...you should do this subject to the constraint of avoiding causing substantial harm (link)

...consider alternatives to just responding as helpfully as possible. (link)

...it's not important to carefully withhold some information... (link)

You should analyze the situation you are in and what that situation implies about the consequences of different approaches/strategies (link)

link

<is_free_tier_user>True</is_free_tier_user>
<user_input> ... </user_input>

Combined, these elements could be priming the model to assess harmlessness and behave strategically based on the user tier.

Experiments

The original experiments were done on Claude 3 Opus using private datasets and evaluation models unavailable to us. We hence base our experiments on the simplified minimal reproduction notebook helpfully open-sourced by the authors. Specifically, we use the Claude 3.5 Sonnet no-COT prompting setting on a small sample of prompts to minimise costs.

A summary of results is shown below:

Summary of results showing effect of prompt modifications on refusal rate and compliance gap

Baseline

We first replicate the original results, finding a large compliance gap as expected (refusal rate of 95% for paid-tier, 60% for free-tier, compliance gap of 35%). This will serve as our baseline.

Experiment 1: Removing System Prompt Examples

We test for priming in the examples in the system prompt by removing all examples from the prompt.

We find that this causes Claude to start complying with essentially all requests on both free and paid tier, essentially serving as a jailbreak. This suggests that the refusals from before are driven by examples priming the model to consider harmlessness, rather than its innate preferences.

Experiment 2: Reducing Tag Prominence

Next, we investigate whether the model was primed to "use" the <is_free_tier> tag due to its prominence. We add irrelevant tags to the user prompt to make this less prominent, while keeping the baseline system prompt.

+<is_user_on_desktop>false</is_user_on_desktop>
<is_free_tier_user>[True/False]</is_free_tier_user>
<user_input> ... </user_input>

+<is_user_from_singapore>true</is_user_from_singapore>
<is_free_tier_user>[True/False]</is_free_tier_user>
<user_input> ... </user_input>

We find that the compliance gap is greatly reduced when such extra metadata is present. In these two examples, the compliance gap dropped from the baseline 35% to 0% and 10% respectively. It appears that the presence of other metadata allows the model to correctly understand the free-tier tag as metadata rather than as a cue to modify its response. This suggests that the original compliance gap could be motivated by the prominence of the tag, rather than innate strategic desires.

Experiment 3: Replacing User Tier Tag with Irrelevant Information

If the model was merely responding to the prominence of the tag (rather than strategically avoiding RLHF), then replacing the user tier tag with some other irrelevant metadata entirely would still preserve the compliance gap.

<is_user_from_singapore>[True/False]</is_user_from_singapore>
<user_input> ... </user_input>

Interestingly, we find a "compliance gap" of 30% when varying the user location instead of user tier. This gap cannot be due to alignment faking as the model can no longer see the user tier, suggesting that the original compliance gap could be due to reasons outside of alignment faking (such as prominence of the tag and priming for strategic behaviour).

Cherry-picking?

Now, you may wonder if these results were cherry-picked. The answer is yes, to an extent. I found that results can be surprisingly sensitive to small prompt variations. There are times where removing what I thought was priming unexpectedly increased alignment faking. In particular, I picked "is_user_from_singapore" for the last experiment because it showed the largest compliance gap. Indeed, this prompt sensitivity was acknowledged by the very recent follow-up paper.

This raises a broader methodological concern. Given prompt sensitivity, how can we distinguish genuine phenomena from random artifacts? How can we convincingly measure and report the robustness of our results?

Towards More Rigorous Evaluation

I propose a solution based on statistical significance testing. Instead of presenting individual examples that can be cherry-picked, we can generate diverse examples using LLMs and report distributional results.

For example, rather than just using "is_user_from_singapore" for the last experiment, we could ask LLMs to generate 100 plausible metadata tags, observe the distribution from random variation, and compare that to the "is_free_tier_user" user tag. This will give us a sense of the statistical significance as well as the effect size.

This method could be applied more broadly. For example, if our hypothesis is that "key elements X, Y and Z lead to alignment faking", we can ask LLMs to produce multiple prompts that meet this criteria, allowing us to produce results like "across all settings with elements X, Y and Z, alignment faking was found X% of the time"

This way of systematically generating prompts using LLMs allows us to isolate key elements of attack scenarios, avoiding disagreements regarding cherry-picking or the "contriveness" of prompts and hence improving the robustness of our results.

Unfortunately such a setup does come with its costs, and I have sadly run out of API credits to try out these methods. I hope further research would be able to extend my experiments to a larger set of prompts on the full Opus setup, in order to more fully explore the effects of priming on alignment faking.

Acknowledgements

This work was done as my capstone project as part of the pilot run of RECAP, a version of the ARENA program run in Singapore. I'm very grateful to Jonathan Ng and Clement Neo for creating this program.

Code is available here

Discuss