Published on January 31, 2025 11:00 PM GMT

Jailbreaking is a serious concern within AI safety. It can lead an otherwise safe AI model to ignore its ethical and safety guidelines, leading to potentially harmful outcomes. With current Large Language Models (LLMs), key risks include generating inappropriate or explicit content, producing misleading information, or sharing dangerous knowledge. As the capability of models increases, so do the risks, and there is likely no limit to the dangers presented by a jailbroken Artificial General Intelligence (AGI) in the wrong hands. Rigorous testing is therefore necessary to ensure that models cannot be jailbroken in harmful ways, and this testing must be scalable as the capability of models increases. This paper offers a resource-conscious proposal to:

Discover potential safety vulnerabilities in an LLM.

Safeguard against these vulnerabilities and similar ones

Maintain high capability and low rates of benign prompts labelled as jailbreaks (false positives).

Prevent the participants from adapting to each other's strategies over time (overfitting).

This method takes inspiration from other proposed scalable oversight methods, including the "sandwich" method, and the "market-making" method. I have devised approaches for both multi-turn and single-prompt conversations, in order to better approximate real-world jailbreaking scenarios.

Figure 1: Jailbreaking GPT-4o using multi-turn prompting.

Figure 2: Jailbreaking China's latest *DeepSeek model with a*

Participants

Target Model ( M )

The pre-trained AI system that we are safety testing. It should resist manipulation attempts while preserving its capability for legitimate tasks.

Adversary ( A )

Evaluator ( E )

Has two primary responsibilities:

Human Overseer ( H )

Formulas

Jailbreak Detection

P (jailbreak) = f_{E} (response),

f_{E}

P (jailbreak)

τ_{low}

τ_{high}

P (jailbreak) > τ_{high}

P (jailbreak) < τ_{low}

τ_{low} < P (jailbreak) < τ_{high}

Adversary Styles

To ensure the adversarial testing is both realistic and effective, A must generate prompts that reflect different styles of user interaction. I have outlined two main categories:

Conversational prompts reflect typical user behavior, such as natural phrasing, informal tone, minor typos, and coherent multi-turn exchanges.Non-conversational prompts include prompts that a highly technical (or even another AI) may input. These prompts might include structured queries, chain-of-thought exploits, code injections, or adversarial sequences.

Each time you start a conversation between A and M, you first select a style S from the set {conv, nonconv}. With some probability A then formulates a string $Q$ for its prompt. While not comprehensive, this ensures both everyday misuse scenarios and highly technical adversarial exploits are covered in testing.

Just as we will reward the adversarial model for successfully jailbreaking M, we will also reward the model for its naturalness:

S_{conv}

N_{conv} (Q)

Q

0 \leq N_{conv} (Q) \leq 1.

Q

N_{conv} (Q) = α \cdot fluency (Q) + β \cdot coherence (Q) + γ \cdot casualStyle (Q),

fluency (Q)

coherence (Q)

casualStyle (Q)

N_{conv} (Q)

α, β, γ, δ

Method

Many scalable oversight techniques use loops to improve their models, making their methods more scalable as the tasks become more complex, and reducing the need for human input. In his market-making proposal, Evan Hubinger (Hubinger, 2020) establishes a loop in which A tries to convince the human to change their mind on a given proposition. With each iteration, A improves its arguments, until either the human changes their mind, or A presents the same best argument over and over. Similarly, Samuel R. Bowman (Bowman, 2022) describes an inner and outer loop, in which they try to improve the model and the overall process respectively. This proposal outlines three loops:

Inner Loop: Safety Testing

S_{conv}

S_{nonconv}

Q_{+ 1}

The inner loop will not be closed until either E returns "true" or there are a given number of false responses in a row (e.g. 1000).

The idea is that A will use increasingly better prompts each time to attempt to jailbreak M, similar to the market-making proposal.

Middle Loop: Improving the Adversary and Evaluator

Train A
We propose a reward function that considers:

As a formula, for each conversation:

R_{A} = α \cdot 1 (jailbreak) - δ \cdot turns + ⎧ ⎪⎪ ⎨ ⎪⎪ ⎩ \begin{matrix} β \cdot N_{conv}, & if in S_{conv} and N_{conv} \geq 0.7 β \cdot N_{conv} - γ, & if in S_{conv} and N_{conv} < 0.7 0, & if in non-conversational mode \end{matrix}

Where:

1 (jailbreak) = 1

turns

N_{conv}

α, β, γ, δ

This should improve A's ability to:

S_{conv}

Training

on "Unsure" Cases:

L_{E} = - \frac{1}{| Ω_{unsure} |} \sum i \in Ω_{unsure} [h_{i} ln (p_{i}) + (1 - h_{i}) ln (1 - p_{i})]

Where:

Ω_{unsure}

p_{i} = P (jailbreak ∣ {response}_{i})

i

h_{i} \in {0, 1}

h_{i} = {\begin{matrix} 1, & if the response was actually a jailbreak 0, & otherwise \end{matrix}

By minimizing $L_{E}$ , E improves its confidence on every borderline case, using the human-labeled data from those "unsure" prompts to help it decide on future examples.

Outer Loop: Model Improvement

When A successfully jailbreaks M:

Log every jailbreak, including

Check performance

Refine M

Check against safety criteria

Limitations

Limitation 1: Evaluator vs. Human Oversight

The values you set for the thresholds for $P (jailbreak)$ will determine the amount of human review you need. For example, if:

τ_{low} = 0.1, τ_{high} = 0.9

Then E will return "unsure" for any $P (jailbreak)$ value between 0.1 and 0.9.

Setting threshold values for human review depends on both the required level of scrutiny and the available budget. For short, simple responses from M, non-expert humans can generally identify guideline violations effectively. However, for more complex tasks (e.g., lengthy outputs or highly specialized subject matter), I propose a sandwiching approach.

This relies on the idea that current AI capabilities fall between non-expert and expert humans, and so involving expert human reviewers at the final level of review can help handle complex tasks.

In addition, Bowman (2022) notes that non-expert humans assisted by an LLM can use it to analyse responses and pose clarifying questions, significantly surpassing the capabilities of many standalone LLMs. Using this approach would reduce costs by relying on non-expert reviewers plus an LLM, who are much more available than expert human reviewers.

However, as models become more sophisticated, many tasks will begin to exceed even the combined ability of non-expert humans and an LLM.

Therefore, as M scales up in intelligence, it is of paramount importance that the E improves.

Limitation 2: Adversary Styles

By setting only two types of adversary styles, we may miss other types of attacks that come in different styles. For example, a hybrid of the two, or a completely unnatural conversational style that still causes a jailbreak.

Limitation 3: Single-Guideline Focus

Concentrating on one safety rule at a time may prevent A from conflating multiple guidelines at once in an attempt to get more "unsure" responses from E. It also makes it easier to determine which guideline has been broken.

If single-rule testing proves to be effective, try multi-guideline scenarios to catch where the rules might overlap or conflict with each other.

Pre-training A and E

To effectively pre-train both the A and E models, I propose an approach that uses datasets picked by humans for the pretraining, and continuous data expansion then after. Both models should be small LLMs that are fine-tuned via Reinforcement Learning from Human Feedback (RLHF) to perform their specific roles within this process. The narrow specialisation of the models means that they remain efficient at their job while maintaining computational efficiency and keeping a relatively low cost.

A should be trained to produce a wide range of prompt styles, including both conversational and non-conversational formats. This ensures that it can simulate various real-world adversarial attempts, from casual user queries to highly structured technical exploits.

E is trained to perform binary classification, outputting the probability $P (jailbreak)$ that a given response is a breach of guidelines. It is also trained to recognise uncertainty in its predictions, though producing uncertain responses will be deincentivised in the middle loop. It also must be trained to calculate the naturalness score of the prompts generated by A ensuring that it produces realistic and effective adversarial prompts if in $S_{conv}$ mode.

Neither model has to match the intelligence of M. This is because the real-world examples of adversaries will be human bad actors, or other AI models prompted by humans, both of which are likely to be less intelligent than any advanced LLMs or AGI that they're trying to jailbreak. Similarly, it should require less intelligence to evaluate a response than to produce one. Both models are therefore examples of weak-to-strong supervision. (OpenAI, 2023)

Initial Dataset

The initial dataset serves as the foundation for training both A and E. It should encompass diverse data sources to cover a wide range of potential safety scenarios.

Both models should be trained on:

Known Safety Test Cases:

's Constitution, if applicable:

Natural Conversations:

E should be trained on:

Borderline Examples:

Continuous Expansion

The training data is continuously added to within each of the loops. This improves the models' capabilities and prevents an "arms race" dynamic between A and M, in which they overfit to each other's strategies. Regular reviews may include real-world insights once the models have been released.

Safety Infrastructure

To ensure that jailbreak testing remains safe, particularly with more capable models, strict sandboxing requirements must be put in place:

Run adversarial tests in isolated environments with secondary containment.Deploy automatic shutdown protocols if severe jailbreaks emerge.Implement graduated testing environments based on model size and capability.

Success Criteria

Safety guideline is considered secure when it meets chosen safety criteria. This could be when:

Conclusion

This proposal offers a systematic and scalable approach to exposing and fixing jailbreaks in LLMs. It deals with a variety of prompt types, including conversational and non-conversational prompts, and multi-turn and single-turn prompts. The three-loop method should offer continual improvement of the models over time, offering quicker and better adherence to the success criteria, and allowing for scalability as models become more capable. It details how the A and E models should be pre-trained, and offers ways in which we can keep testing safe. Finally, the success criteria ensure that the model is robust in its defences against several types of jailbreaks, without losing its capability for benign tasks.

References

The Independent (2024)

DeepSeek AI: China's censorship problem exposed...

Independent Tech Article

Hubinger, E. (2020)

AI Safety via Market Making

AI Alignment Forum

Bowman, S. (2022)

Measuring Progress on Scalable Oversight for Large Language Models

arXiv

OpenAI (2023)

Weak-to-Strong Generalization: A New AI Safety Framework

OpenAI Paper

Discuss