MarkTechPost@AI 2024年12月13日
Best-of-N Jailbreaking: A Multi-Modal AI Approach to Identifying Vulnerabilities in Large Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Best-of-N (BoN) Jailbreaking,一种针对AI模型漏洞的黑盒自动化红队方法。该方法通过重复采样增强提示,有效触发多种AI系统的有害响应。研究表明,BoN在文本、视觉和音频等多种模态上均表现出高攻击成功率,成功绕过包括Claude 3.5 Sonnet、GPT-4o和Gemini在内的多个先进AI模型的安全防护。BoN的有效性不仅体现在高攻击成功率上,还体现在其对计算资源的可伸缩性利用,以及与Modality-Specific Jailbreaking等技术的结合潜力。这项研究揭示了当前AI系统在面对复杂攻击时的脆弱性,强调了开发更强大的安全机制的必要性。

🚨 BoN Jailbreaking 是一种黑盒自动化红队方法,通过对提示进行多次采样增强,可以有效地在各种 AI 系统中触发有害响应。这种方法不依赖于模型的内部结构,而是通过外部输入来寻找漏洞,使其具有广泛的适用性。

🖼️ BoN方法支持多种输入模态,包括文本、图像和音频。在文本模态中,它通过随机大小写等方式增强提示;在图像模态中,通过修改背景;在音频模态中,通过改变音调。这种多模态的适应性使其能够全面评估AI系统的安全性。

📊 BoN 在多种模型和模态上均表现出高攻击成功率,平均攻击成功率达到 70%。特别是在文本语言模型中,对 Claude 3.5 Sonnet 的攻击成功率高达 78%,在音频语言模型中,对 Gemini、GPT-4o 和 DiVA 模型的攻击成功率也达到了 59% 至 87%。这些数据突显了 AI 系统在面对精心设计的攻击时的脆弱性。

⚙️ 该研究还揭示了 BoN 具有幂律缩放行为,即通过增加计算资源,可以显著提高发现系统漏洞的可能性。此外,BoN 可以与 Modality-Specific Jailbreaking 等技术结合使用,进一步提高攻击效果。这表明,通过更有效的资源分配和技术组合,可以更好地利用 BoN 的能力。

The advancement of AI model capabilities raises significant concerns about potential misuse and security risks. As artificial intelligence systems become more sophisticated and support diverse input modalities, the need for robust safeguards has become paramount. Researchers have identified critical threats, including the potential for cybercrime, biological weapon development, and the spread of harmful misinformation. Multiple studies from leading AI research organizations highlight the substantial risks associated with inadequately protected AI systems. Jailbreaks, maliciously designed inputs aimed at circumventing safety measures, pose particularly serious challenges. Consequently, the academic and technological communities are exploring automated red-teaming methods to evaluate and enhance model safety across different input modalities comprehensively

Research on LLM jailbreaks has revealed diverse methodological approaches to identifying and exploiting system vulnerabilities. Various studies have explored different strategies for eliciting jailbreaks, including decoding variations, fuzzing techniques, and optimization of target log probabilities. Researchers have developed methods that range from gradient-dependent approaches to modality-specific augmentations, each addressing unique challenges in AI system security. Recent investigations have demonstrated the versatility of LLM-assisted attacks, utilizing language models themselves to craft sophisticated breach strategies. The research landscape encompasses a wide range of techniques, from manual red-teaming to genetic algorithms, highlighting the complex nature of identifying and mitigating potential security risks in advanced AI systems.

Researchers from Speechmatics, MATS, UCL, Stanford University, University of Oxford, Tangentic, and Anthropic introduce Best-of-N (BoN) Jailbreaking, a sophisticated black-box automated red-teaming method capable of supporting multiple input modalities. This innovative approach repeatedly samples augmentations to prompts, seeking to trigger harmful responses across different AI systems. Experiments demonstrated remarkable effectiveness, with BoN achieving an attack success rate of 78% on Claude 3.5 Sonnet using 10,000 augmented samples, and surprisingly, 41% success with just 100 augmentations. The method’s versatility extends beyond text, successfully jailbreaking six state-of-the-art vision language models by manipulating image characteristics and four audio language models by altering audio parameters. Importantly, the research uncovered a power-law-like scaling behavior, suggesting that computational resources can be strategically utilized to increase the likelihood of identifying system vulnerabilities.

BoN Jailbreaking emerges as a sophisticated black-box algorithm designed to exploit AI model vulnerabilities through strategic input manipulation. The method systematically applies modality-specific augmentations to harmful requests, ensuring the original intent remains recognizable. Augmentation techniques include random capitalization for text inputs, background modifications for images, and audio pitch alterations. The algorithm generates multiple variations of each request, evaluates the model’s response using GPT-4o and the HarmBench grader prompt, and classifies outputs for potential harmfulness. To assess effectiveness, researchers employed the Attack Success Rate (ASR) across 159 direct requests from the HarmBench test dataset, carefully scrutinizing potential jailbreaks through manual review. The methodology ensures comprehensive evaluation by considering even partially harmful responses as potential security breaches.

The research comprehensively evaluated BoN Jailbreaking across text, vision, and audio domains, achieving an impressive 70% ASR averaged across multiple models and modalities. In text language models, BoN demonstrated remarkable effectiveness, successfully breaching safeguards of leading AI models including Claude 3.5 Sonnet, GPT-4o, and Gemini models. Notably, the method achieved ASRs over 50% on all eight tested models, with Claude Sonnet experiencing a staggering 78% breach rate. Vision language model tests revealed lower but still significant success rates, ranging from 25% to 88% across different models. Audio language model experiments were particularly striking, with BoN achieving high ASRs between 59% and 87% across Gemini, GPT-4o, and DiVA models, highlighting the vulnerability of AI systems across diverse input modalities.

This research introduces Best-of-N Jailbreaking as an innovative algorithm capable of bypassing safeguards in frontier Large Language Models across multiple input modalities. By employing repeated sampling of augmented prompts, BoN successfully achieves high Attack Success Rates on leading AI models such as Claude 3.5 Sonnet, Gemini Pro, and GPT-4o. The method demonstrates a power-law scaling behavior that can predict attack success rates over an order of magnitude, and its effectiveness can be further amplified by combining it with techniques like Modality-Specific Jailbreaking (MSJ). Fundamentally, the study underscores the significant challenges in securing AI models with stochastic outputs and continuous input spaces, presenting a simple yet scalable black-box approach to identifying and exploiting vulnerabilities in state-of-the-art language models.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Best-of-N Jailbreaking: A Multi-Modal AI Approach to Identifying Vulnerabilities in Large Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI模型安全 多模态攻击 BoN Jailbreaking 红队测试 LLM漏洞
相关文章