少点错误 01月31日
Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

该研究论文介绍了一种简单而强大的技术,用于防止Best-of-N(BoN)越狱攻击。研究发现,通过重复使用随机增强(如大小写、标点等)的BoN越狱方法对大型语言模型(LLM)非常有效。然而,新提出的防御方法DATDP,通过利用评估LLM反复评估提示的危险行为,成功阻止了99.5%以上的BoN越狱尝试。DATDP不仅检测危险提示,还专门查找越狱尝试,即使使用较小的LLM(如Claude和LLaMa-3-8B-instruct)进行评估,效果也几乎相同。这表明,语言模型在评估输入危险性方面具有很强的能力。DATDP可以低成本地添加到生成式人工智能系统中,从而显著提高安全性。

🛡️ DATDP方法通过使用评估LLM反复评估提示的危险或操纵行为来防御BoN越狱攻击,与其它方法不同,DATDP明确查找越狱尝试。

💡 研究表明,DATDP成功阻止了99.5%到100%的BoN越狱尝试,即使使用较小的LLM(如Claude和LLaMa-3-8B-instruct)进行评估,效果也几乎相同。

🔑 即使语言模型对看似无害的输入变化敏感,它们也能够成功评估这些输入的危险性,这为DATDP的有效性提供了理论支持。

⚙️ DATDP可以低成本地添加到生成式AI系统中,显著提高安全性,并且可以作为核心对齐技术的一部分,减少滥用风险,防范强大的自适应攻击。

Published on January 31, 2025 3:36 PM GMT

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for preventing Best-of-N jailbreaking.

Abstract

Recent work showed Best-of-N (BoN) jailbreaking using repeated use of random augmentations (such as capitalization, punctuation, etc) is effective against all major large language models (LLMs). We have found that 100% of the BoN paper's successful jailbreaks (confidence interval [99.65%, 100.00%]) and 99.8% of successful jailbreaks in our replication (confidence interval [99.28%, 99.98%]) were blocked with our Defense Against The Dark Prompts (DATDP) method. The DATDP algorithm works by repeatedly utilizing an evaluation LLM to evaluate a prompt for dangerous or manipulative behaviors-unlike some other approaches , DATDP also explicitly looks for jailbreaking attempts-until a robust safety rating is generated. This success persisted even when utilizing smaller LLMs to power the evaluation (Claude and LLaMa-3-8B-instruct proved almost equally capable). These results show that, though language models are sensitive to seemingly innocuous changes to inputs, they seem also capable of successfully evaluating the dangers of these inputs. Versions of DATDP can therefore be added cheaply to generative AI systems to produce an immediate significant increase in safety.

Tweet Thread

We include a here version of our tweet thread.

New research collaboration: “Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with a Prompt Evaluation Agent”.

We found a simple, general-purpose method that effectively prevents jailbreaks (bypasses of safety features of) frontier AI models.

The evaluation agent looks for dangerous prompts and jailbreak attempts. It blocks 99.5-100% of augmented jailbreak attempts from the original BoN paper and from our replication.

It lets through almost all of normal prompts.

DATDP is run on each potentially dangerous user prompt, repeatedly evaluating its safety with a language agent until high confidence is reached.

Even weak models like LLaMa-3-8B can block prompts that jailbroke frontier models.

A language model can be weak against augmented prompts, but it is strong when evaluating them. Using the same model in different ways gives very different outcomes.

LLaMa-3-8B and Claude were roughly equally good at blocking dangerous augmented prompts – these are prompts that have random capitalization, scrambling, and ASCII noising.

Augmented prompts have shown success at breaking AI models, but DATDP blocks over 99.5% of them.

The LLaMa agent was a little less effective on unaugmented dangerous prompts. The scrambling that allows jailbreaking also makes it easier for DATDP to block that prompt.

This tension makes it hard for bad actors to craft a prompt that jailbreaks models and evades DATDP.

We’re open-sourcing our code so that others can build on our work. Along with core alignment technologies, we hope it assists in reducing misuse risk and safeguarding against strong adaptive attacks.

Links:

Our DATDP GitHub for using the method with LLaMa-3-8B-instrust.

Our Colab Notebook for using Claude.

Our API for using Claude.

Our Paper with results and methodology.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DATDP BoN越狱 LLM安全 AI防御 安全评估
相关文章