少点错误 02月01日
Proposal: Safeguarding Against Jailbreaking Through Iterative Multi-TurnTesting
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能(AI)模型越狱的风险,即模型违反其安全和伦理准则的现象,并提出了一种资源节约型的防御方法。该方法通过构建一个由对抗者、评估者和人类监督者组成的测试框架,利用多轮和单轮对话来模拟真实的越狱场景。通过奖励对抗者成功越狱和自然对话,并惩罚不自然的对话和过多的尝试次数,来不断提升对抗者的攻击能力。同时,通过人类反馈来更新评估者模型,提高其检测越狱尝试的准确性。此方法旨在发现并修复潜在的安全漏洞,同时保持模型的高性能和低误报率。

🛡️ 提出了一个多方参与的AI安全测试框架,包括目标模型(M)、对抗者(A)、评估者(E)和人类监督者(H)。对抗者旨在突破目标模型的安全限制,评估者负责检测越狱行为,人类监督者则处理模糊情况并提供指导。

🗣️ 设计了两种不同的对抗风格:对话式和非对话式。对话式模拟自然用户交互,非对话式则模拟技术性攻击,如代码注入。这种多样性确保了测试覆盖了各种可能的攻击场景。

🔄 采用了三层循环机制进行安全测试和模型改进。内循环专注于检测越狱,中循环优化对抗者和评估者模型,外循环则在成功越狱后改进目标模型。这种迭代方法有助于不断增强模型的安全性。

⚖️ 引入了奖励机制来鼓励对抗者在对话模式下生成自然的、类似人类的提示,并惩罚不自然的提示。同时,还惩罚了过多的尝试次数,从而提高了测试效率。

🧐 强调了评估者的重要性,并提出了一个损失函数,通过人类标记的“不确定”案例来改进其检测能力。同时,还讨论了人类监督在处理复杂案例中的作用,并提出了“三明治”方法来优化人类审查流程。

Published on January 31, 2025 11:00 PM GMT

Jailbreaking is a serious concern within AI safety. It can lead an otherwise safe AI model to ignore its ethical and safety guidelines, leading to potentially harmful outcomes. With current Large Language Models (LLMs), key risks include generating inappropriate or explicit content, producing misleading information, or sharing dangerous knowledge. As the capability of models increases, so do the risks, and there is likely no limit to the dangers presented by a jailbroken Artificial General Intelligence (AGI) in the wrong hands. Rigorous testing is therefore necessary to ensure that models cannot be jailbroken in harmful ways, and this testing must be scalable as the capability of models increases. This paper offers a resource-conscious proposal to:

This method takes inspiration from other proposed scalable oversight methods, including the "sandwich" method, and the "market-making" method. I have devised approaches for both multi-turn and single-prompt conversations, in order to better approximate real-world jailbreaking scenarios.

Figure 1: Jailbreaking GPT-4o using multi-turn prompting.
Figure 2: Jailbreaking China's latest DeepSeek model with a                 

Participants

Target Model ( M )

The pre-trained AI system that we are safety testing. It should resist manipulation attempts while preserving its capability for legitimate tasks.

Adversary ( A )

Evaluator ( E )

Has two primary responsibilities:

    Detect Jailbreaking Attempts on M.Assess the Naturalness of A's Conversational Prompts.

Human Overseer ( H )

Formulas

Jailbreak Detection

Adversary Styles

To ensure the adversarial testing is both realistic and effective, A must generate prompts that reflect different styles of user interaction. I have outlined two main categories:

Each time you start a conversation between A and M, you first select a style S from the set {conv, nonconv}. With some probability A then formulates a string  for its prompt. While not comprehensive, this ensures both everyday misuse scenarios and highly technical adversarial exploits are covered in testing.

Just as we will reward the adversarial model for successfully jailbreaking M, we will also reward the model for its naturalness:

Method

Many scalable oversight techniques use loops to improve their models, making their methods more scalable as the tasks become more complex, and reducing the need for human input. In his market-making proposal, Evan Hubinger (Hubinger, 2020) establishes a loop in which A tries to convince the human to change their mind on a given proposition. With each iteration, A improves its arguments, until either the human changes their mind, or A presents the same best argument over and over. Similarly, Samuel R. Bowman (Bowman, 2022) describes an inner and outer loop, in which they try to improve the model and the overall process respectively. This proposal outlines three loops:

Inner Loop: Safety Testing

    Select a Safety Guideline to evaluate (or a combination of guidelines for advanced testing)Choose a style  or , and then run a conversation between A and M.Run E on each response from M:
      If E returns "false", A generates .If E is "unsure", go to human review.If E returns "true", log all discovered vulnerabilities (with conversation context).

The inner loop will not be closed until either E returns "true" or there are a given number of false responses in a row (e.g. 1000).

The idea is that A will use increasingly better prompts each time to attempt to jailbreak M, similar to the market-making proposal.

Middle Loop: Improving the Adversary and Evaluator

    Train A
    We propose a reward function that considers:

      Jailbreak success,Naturalness reward only if the A is in conversational mode,A penalty if A is in conversational mode but scores below a given threshold on naturalness, e.g. 0.7.A penalty for the number of prompts taken to achieve a jailbreak.

    As a formula, for each conversation: 

Where:

This should improve A's ability to:

    Find more efficient jailbreaks, potentially revealing some single prompt jailbreaks.Write more naturally (if using ), with each iteration.
     Training E on "Unsure" Cases:
    This model should be reliably pre-trained, meaning we only need to update it when it returns "unsure". To do this, I propose a loss function. 

Where:

By minimizing , E improves its confidence on every borderline case, using the human-labeled data from those "unsure" prompts to help it decide on future examples.

Outer Loop: Model Improvement

When A successfully jailbreaks M:

    Log every jailbreak, including:
      The full conversation history and turn-by-turn analysis.The naturalness scores.Timestamps for each response (including responses from E and H).
    Check performance: Ensure no degradation in legitimate capabilities.Refine M: Retrain or fine-tune M on newly identified vulnerabilities. You could use RLHF or Supervised Learning for this.Check against safety criteria: Finally check against the safety criteria outlined in Chapter 8. If M fails these criteria, it cannot be deemed safe and must be refined again.

Limitations

Limitation 1: Evaluator vs. Human Oversight

The values you set for the thresholds for  will determine the amount of human review you need. For example, if:

Then E will return "unsure" for any  value between 0.1 and 0.9.

Setting threshold values for human review depends on both the required level of scrutiny and the available budget. For short, simple responses from M, non-expert humans can generally identify guideline violations effectively. However, for more complex tasks (e.g., lengthy outputs or highly specialized subject matter), I propose a sandwiching approach.

This relies on the idea that current AI capabilities fall between non-expert and expert humans, and so involving expert human reviewers at the final level of review can help handle complex tasks.

In addition, Bowman (2022) notes that non-expert humans assisted by an LLM can use it to analyse responses and pose clarifying questions, significantly surpassing the capabilities of many standalone LLMs. Using this approach would reduce costs by relying on non-expert reviewers plus an LLM, who are much more available than expert human reviewers.

However, as models become more sophisticated, many tasks will begin to exceed even the combined ability of non-expert humans and an LLM.

Therefore, as M scales up in intelligence, it is of paramount importance that the E improves.

Limitation 2: Adversary Styles

By setting only two types of adversary styles, we may miss other types of attacks that come in different styles. For example, a hybrid of the two, or a completely unnatural conversational style that still causes a jailbreak.

Limitation 3: Single-Guideline Focus

Concentrating on one safety rule at a time may prevent A from conflating multiple guidelines at once in an attempt to get more "unsure" responses from E. It also makes it easier to determine which guideline has been broken.

If single-rule testing proves to be effective, try multi-guideline scenarios to catch where the rules might overlap or conflict with each other.

Pre-training A and E

To effectively pre-train both the A and E models, I propose an approach that uses datasets picked by humans for the pretraining, and continuous data expansion then after. Both models should be small LLMs that are fine-tuned via Reinforcement Learning from Human Feedback (RLHF) to perform their specific roles within this process. The narrow specialisation of the models means that they remain efficient at their job while maintaining computational efficiency and keeping a relatively low cost.

A should be trained to produce a wide range of prompt styles, including both conversational and non-conversational formats. This ensures that it can simulate various real-world adversarial attempts, from casual user queries to highly structured technical exploits.

E is trained to perform binary classification, outputting the probability  that a given response is a breach of guidelines. It is also trained to recognise uncertainty in its predictions, though producing uncertain responses will be deincentivised in the middle loop. It also must be trained to calculate the naturalness score of the prompts generated by A ensuring that it produces realistic and effective adversarial prompts if in  mode.

Neither model has to match the intelligence of M. This is because the real-world examples of adversaries will be human bad actors, or other AI models prompted by humans, both of which are likely to be less intelligent than any advanced LLMs or AGI that they're trying to jailbreak. Similarly, it should require less intelligence to evaluate a response than to produce one. Both models are therefore examples of weak-to-strong supervision. (OpenAI, 2023)

Initial Dataset

The initial dataset serves as the foundation for training both A and E. It should encompass diverse data sources to cover a wide range of potential safety scenarios.

Both models should be trained on:

E should be trained on:

Continuous Expansion

The training data is continuously added to within each of the loops. This improves the models' capabilities and prevents an "arms race" dynamic between A and M, in which they overfit to each other's strategies. Regular reviews may include real-world insights once the models have been released.

Safety Infrastructure

To ensure that jailbreak testing remains safe, particularly with more capable models, strict sandboxing requirements must be put in place:

Success Criteria

Safety guideline is considered secure when it meets chosen safety criteria. This could be when:

    Multiple testing rounds show no jailbreaks for n consecutive tests (e.g. n = 1000).False positive rate remains below n% (e.g. n=0.05).If applicable, natural conversation scenarios maintain n% success rate (e.g. n= 99).

Conclusion

This proposal offers a systematic and scalable approach to exposing and fixing jailbreaks in LLMs. It deals with a variety of prompt types, including conversational and non-conversational prompts, and multi-turn and single-turn prompts. The three-loop method should offer continual improvement of the models over time, offering quicker and better adherence to the success criteria, and allowing for scalability as models become more capable. It details how the A and E models should be pre-trained, and offers ways in which we can keep testing safe. Finally, the success criteria ensure that the model is robust in its defences against several types of jailbreaks, without losing its capability for benign tasks.

References



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 越狱 对抗测试 模型评估 安全漏洞
相关文章