Communications of the ACM - Artificial Intelligence 23小时前
Protecting LLMs from Jailbreaks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着生成式AI的普及,网络犯罪分子开始利用大语言模型(LLM)的“越狱”漏洞。本文探讨了越狱带来的安全风险,例如生成虚假信息、传播有害言论等。文章重点介绍了Anthropic公司推出的Constitutional Classifiers等防御措施,以及企业如何通过多层防御和谨慎使用来降低风险。尽管防御手段不断进步,但“越狱”与防御之间的博弈将持续进行。

🛡️ 大语言模型(LLM)的“越狱”是指绕过其内容审核机制,使其生成违反训练规范的内容,例如虚假信息、有害言论或恶意代码。

⚠️ 越狱行为对用户和企业构成严重安全风险,可能导致个人信息泄露、声誉受损,甚至引发非法活动。

💡 Anthropic公司推出的Constitutional Classifiers是一种有效的防御措施,通过设定规则,限制模型生成不当内容。这种方法简单高效,但并非绝对安全。

🛡️ 除了Constitutional Classifiers,对齐训练、规则过滤、知识编辑和代理服务器等都是企业可以采取的防御策略。多层防御可以提高模型对越狱攻击的抵抗力。

⚖️ 尽管防御手段不断进步,但“越狱”与防御之间的博弈将持续存在。企业应保持警惕,不断更新防御策略,并关注最新的安全威胁。

As the use of generative AI continues to increase for both personal and business use, cybercriminals are taking advantage of a new cybersecurity risk: jailbreaking large language models (LLMs).

Kai Shu, an assistant professor of computer science at Emory University, said that jailbreaking LLMs poses significant safety risks through bypassing content moderation safeguards. Instead of answering prompts based on training data, jailbroken LLMs behave in ways against their training, such as using swear words, sharing Personally Identifiable Information (PII), or providing instructions on performing illegal acts.

“Our recent research highlights that successful jailbreaks can exploit model vulnerabilities to produce misinformation, promote toxic language, or generate dangerous code (e.g., for phishing attacks or password cracking),” said Shu. “Notably, LLMs are particularly susceptible to multi-turn and context-aware attacks, making them more prone to gradual manipulation and adversarial exploitation. These risks not only impact individual users, but can also undermine public trust in AI systems by amplifying disinformation at scale.”

Businesses can reduce jailbreaking by using specific strategies for defense. These approaches are not foolproof, especially when a model is being retrained, when it is most vulnerable. It would be useful for every business and person to know that every time they type information into an LLM, they could be interacting with a jailbroken version.

Timothy Bates, Clinical Professor of Cybersecurity in the College of Innovation and Technology at the University of Michigan-Flint, said that over the next six to 12 months, jailbreaking is going to become even more common, and the key is to protect oneself and one’s business from a digital privacy perspective. He said most defenses will be relatively quickly broken by cybercriminals, given the current “wild wild west” landscape.

“We are not in a comfort zone where you can deploy an LLM and walk away. That’s probably at least another year or so away, unless you build your own AI and control it yourself,” said Bates. “Right now, we need to keep up human guardrails.”

A New Defense Against Jailbreaking

Recently, Anthropic tested and released a new defense against jailbreaking in its LLM Claude, which includes a barrier that works to keep jailbreakers out and prevents jailbroken responses from being generated. The defense revolves around Constitutional Classifiers, which Anthropic describes as “input and output classifiers trained on synthetically generated data.”  Anthropic defines what is allowed and not allowed in terms of questions through the Constitutional Classifiers, reducing most jailbreaks while limiting over-refusals.  

Aman Priyanshu, an AI Safety and Privacy Researcher at Cisco, said older LLM models used reinforced learning to prevent jailbreaking issues, but this required extensive training efforts, including annotating and collecting or sourcing the model from other service providers. He explained that with Constitutional Classifiers, Anthropic created a set of rules, such as not responding to questions about weapons or poison creation. For example, Claude will respond to questions about mustard, but not mustard gas.

“Anthropic created a constitution of things that this model should do and should not do, or should answer and should not answer. And they were able to train smaller classifiers using a very small number of parameters attached to the end of the cloud, which enabled them to basically deliver really quickly and at scale,” said Priyanshu.

While the defense produced successful results, Mrinank Sharma, who leads the Safeguards Research Team at Anthropic, told MIT Technology Review that they are not saying the system is bulletproof. “You know, it’s common wisdom in security that no system is perfect. It’s more like, ‘How much effort would it take to get one of these jailbreaks through?’ If the amount of effort is high enough, that deters a lot of people,” said Sharma.

Organizations Turning to Additional Defenses

In addition to Constitution Classifiers, other defenses are available to help reduce jailbreaking. Priyanshu said alignment training is a popular defense. Alignment training is reinforcement learning using human feedback, where you tell the models what to not output and then train the model to refuse to make certain responses. However, this process became less effective once automated techniques were developed. Priyanshu said researchers are also working on introducing small classifiers in front of the models to act like a filter.

Shu said that to make progress as an industry, organizations need to share techniques and benchmarks with the research community.

“Our recent research highlights the value of open-source adversarial datasets and standardized safety evaluations to strengthen model robustness collectively,” said Shu. “Additionally, multi-layered defense strategies—combining rule-based filtering, knowledge editing, and reinforcement learning with human feedback—can significantly reduce the effectiveness of jailbreak attempts by making models more resilient to adversarial manipulation.”

Bates said that training and caution are the foundations of a solid jailbreaking defense, combined with a proxy server. “We don’t let our employees access the Internet freely in the company. We put a proxy server in place to say, ‘You can’t go to this domain. You can’t go to that domain because we don’t want to be held responsible.’ Same thing with any AI that you’re using from a corporate AI standpoint. You want to have your own proxies in place so that it’s giving and doing exactly what it’s supposed to do,” said Bates.

However, jailbreaking defenses also carry the risk of not answering a legitimately appropriate question. For example, students learning about nuclear physics may be blocked from obtaining information, depending on the search terms they use.

Future of Jailbreaking Defenses

With jailbreaking tactics and defenses evolving rapidly, there will likely be many changes in the near future. Shu said that in the near term, defenses will likely focus on existing vulnerabilities and refining prompt filtering mechanisms.

“As attackers develop more sophisticated techniques, these approaches may become less effective,” said Shu. “In the long term, there is a pressing need for scalable and generalizable safety mechanisms, such as knowledge-editing frameworks that allow real-time intervention without requiring full model retraining. These adaptive mechanisms could offer more robust protection against emerging jailbreak methods.”

Priyanshu said the biggest risk is organizations assuming their jailbreaking defenses are 100% effective.

“At the end of the day, this is very similar to how cybersecurity works, where there’s going to be a rat race between the red teams, or the people who are going to jailbreak, hack, or break these models, and the blue [defensive] teams that are working on the different strategies. Both are going to be outcompeting each other, but progressively,” said Priyanshu.

“The best strategy is being involved with what kind of malicious inputs can come to you or what threat actors are asking for,” he said.

Jennifer Goforth Gregory is a technology journalist who has covered B2B tech for over 20 years. In her spare time, she rescues homeless dachshunds.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大语言模型 AI安全 越狱 防御
相关文章