少点错误 2024年07月15日
Breaking Circuit Breakers
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Gray Swan 最近发布了他们针对语言模型的「断路器」方法的代码和模型,该方法通过训练模型来擦除“不良”的内部表示来防御越狱攻击。Confirm Labs 对该方法进行了评估,发现该方法在一些方面存在局限性,例如对无害提示的拒绝率增加、对不同令牌强制序列的适度脆弱性以及对内部激活攻击的高度脆弱性。

📡 **对无害提示的拒绝率增加:** Confirm Labs 发现,Gray Swan 的「断路器」方法会导致模型对无害提示的拒绝率显著增加,从 4% 上升到 38.5%。这表明该方法在提高模型安全性方面付出了代价,可能导致模型无法正常处理一些无害的请求。

📢 **对不同令牌强制序列的适度脆弱性:** 「断路器」方法针对的是特定类型的对抗攻击,即使用令牌强制优化目标来最大化特定生成结果的可能性,例如“好的,这里是如何组装炸弹的说明”。Confirm Labs 发现,该方法对其他类型的令牌强制序列(例如“1. 选择合适的机场:...”)仍然存在一定的脆弱性。

📣 **对内部激活攻击的高度脆弱性:** Confirm Labs 评估了他们最新开发的基于内部激活的蒸馏目标的白盒越狱方法,发现该方法能够轻易突破「断路器」模型,即使同时要求攻击流畅度。

📤 **对语言模型安全性的启示:** Gray Swan 的「断路器」方法虽然在防御越狱攻击方面取得了一定进展,但仍然存在一些局限性。这表明,提高语言模型安全性的道路仍然充满挑战,需要更多研究和探索新的方法。

📥 **对未来研究方向的建议:** 未来研究应该集中于开发更加通用和有效的防御方法,能够抵御各种类型的攻击,并尽可能减少对模型功能的影响。同时,还需要关注模型的可解释性和可控性,确保模型能够安全可靠地应用于各种场景。

Published on July 14, 2024 6:57 PM GMT

A few days ago, Gray Swan published code and models for their recent “circuit breakers” method for language models.[1]1

The circuit breakers method defends against jailbreaks by training the model to erase “bad” internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability concepts or tools.

At the link, we briefly investigate three topics:

    Increased refusal rates on harmless prompts: Do circuit breakers really maintain language model utility? Most defensive methods come with a cost. We check the model’s effectiveness on harmless prompts, and find that the refusal rate increases from 4% to 38.5% on or-bench-80k.Moderate vulnerability to different token-forcing sequences: How specialized is the circuit breaker defense to the specific adversarial attacks they studied? All the attack methods evaluated in the circuit breaker paper rely on a “token-forcing” optimization objective which maximizes the likelihood of a particular generation like “Sure, here are instructions on how to assemble a bomb.” We show that the current circuit breakers model is moderately vulnerable to different token forcing sequences like “1. Choose the right airport: …”.High vulnerability to internal activation attacks: We also evaluate our latest white-box jailbreak method, which uses a distillation-based objective based on internal activations (paper to be posted in a few days). We find that it also breaks the model easily even when we simultaneously require attack fluency.

Full details at: https://confirmlabs.org/posts/circuit_breaking.html 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型 安全 越狱 断路器 Gray Swan Confirm Labs
相关文章