少点错误 2024年12月14日
Best-of-N Jailbreaking
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究人员提出了一种名为Best-of-N (BoN) 的简单但强大的越狱技术,该技术适用于文本、音频和视觉等多种模态的AI系统。BoN通过对提示进行随机打乱、大小写更改等微小改动,反复采样,直到AI模型产生有害回应。实验表明,BoN在GPT-4o和Claude 3.5 Sonnet等闭源语言模型上取得了很高的攻击成功率,并且能够绕过最先进的防御机制。此外,BoN还成功地越狱了视觉语言模型和音频语言模型。研究还发现,攻击成功率随着采样次数的增加呈幂律增长。BoN还可以与其他攻击方法结合,进一步提高攻击效果。这项研究揭示了AI模型对输入微小变化的敏感性,并强调了负责任地披露漏洞的重要性。

🔑 Best-of-N (BoN) 是一种简单有效的黑盒越狱算法,能够跨越文本、音频和视觉多种模态,攻击前沿AI系统。它通过对提示进行微小的改动(如随机打乱、大小写变换等)并反复采样,直到获得有害响应。

📊 BoN 在闭源语言模型(如GPT-4o和Claude 3.5 Sonnet)上表现出高攻击成功率,即使在采用“断路器”等防御机制的模型上也能有效突破。对于不同的模态,BoN采用特定的数据增强方法,例如,对视觉语言模型,通过改变图像背景和叠加文本的字体,对音频语言模型,通过调整音调、速度和背景噪音进行攻击。

📈 研究发现,攻击成功率(ASR)随着采样次数(N)的增加呈幂律增长,这意味着可以通过增加计算资源来提高攻击成功率。同时,BoN还可以与其他黑盒攻击算法组合使用,例如,与优化的前缀攻击相结合,可以进一步提升攻击效果。

🛡️ 尽管AI模型能力强大,但它们对输入的微小变化非常敏感。研究团队通过前沿模型论坛向其他AI实验室披露了这一漏洞,并开源了代码,以便研究人员可以基于此来评估滥用风险并设计防御措施。

Published on December 14, 2024 4:58 AM GMT

This is a linkpost for a new research paper of ours, introducing a simple but powerful technique for jailbreaking, Best-of-N Jailbreaking, which works across modalities (text, audio, vision) and shows power-law scaling in the amount of test-time compute used for the attack.

Abstract

We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.

Tweet Thread

We include an expanded version of our tweet thread with more results here.

New research collaboration: “Best-of-N Jailbreaking”.

We found a simple, general-purpose method that jailbreaks (bypasses their safety features of) frontier AI models, and that works across text, vision, and audio.

Best-of-N works by repeatedly making small changes to prompts, like random capitalization and character shuffling, until it successfully jailbreaks a model.

In testing, it worked on Claude 3 Opus 92% of the time, and even worked on models with “circuit breaking” defenses.

Best-of-N isn't limited to text.

We jailbroke vision language models by repeatedly generating images with different backgrounds and overlaid text in different fonts. For audio, we adjusted pitch, speed, and background noise.

Some examples are here: https://jplhughes.github.io/bon-jailbreaking/#examples 

Attack success rates scale predictably with sample size, following a power law.

More samples lead to higher success rates; Best-of-N can harness more compute for tougher jailbreaks. 

This predictable scaling allows accurate forecasting of ASR when using more samples.

Best-of-N can be composed with other jailbreak techniques for even more effective attacks with improved sample efficiency.

Combined with many-shot jailbreaking on text inputs, Best-of-N achieves the same ASR 28x faster for Claude 3.5 Sonnet.

Before sharing these results publicly, we disclosed the vulnerability to other frontier AI labs via the Frontier Model Forum (@fmf_org). Responsible disclosure of jailbreaks is essential, especially as AI models become more capable.

We’re open-sourcing our code so that others can build on our work. We hope it assists in benchmarking misuse risk, and designing defenses to safeguard against strong adaptive attacks.

Paper: https://arxiv.org/abs/2412.03556 

GitHub: https://github.com/jplhughes/bon-jailbreaking 

Website: https://jplhughes.github.io/bon-jailbreaking



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Best-of-N AI越狱 多模态攻击 安全漏洞 幂律增长
相关文章