Protecting LLMs from Jailbreaks

As the use of generative AI continues to increase for both personal and business use, cybercriminals are taking advantage of a new cybersecurity risk: jailbreaking large language models (LLMs).

Kai Shu, an assistant professor of computer science at Emory University, said that jailbreaking LLMs poses significant safety risks through bypassing content moderation safeguards. Instead of answering prompts based on training data, jailbroken LLMs behave in ways against their training, such as using swear words, sharing Personally Identifiable Information (PII), or providing instructions on performing illegal acts.

“Our recent research highlights that successful jailbreaks can exploit model vulnerabilities to produce misinformation, promote toxic language, or generate dangerous code (e.g., for phishing attacks or password cracking),” said Shu. “Notably, LLMs are particularly susceptible to multi-turn and context-aware attacks, making them more prone to gradual manipulation and adversarial exploitation. These risks not only impact individual users, but can also undermine public trust in AI systems by amplifying disinformation at scale.”

Businesses can reduce jailbreaking by using specific strategies for defense. These approaches are not foolproof, especially when a model is being retrained, when it is most vulnerable. It would be useful for every business and person to know that every time they type information into an LLM, they could be interacting with a jailbroken version.

Timothy Bates, Clinical Professor of Cybersecurity in the College of Innovation and Technology at the University of Michigan-Flint, said that over the next six to 12 months, jailbreaking is going to become even more common, and the key is to protect oneself and one’s business from a digital privacy perspective. He said most defenses will be relatively quickly broken by cybercriminals, given the current “wild wild west” landscape.

“We are not in a comfort zone where you can deploy an LLM and walk away. That’s probably at least another year or so away, unless you build your own AI and control it yourself,” said Bates. “Right now, we need to keep up human guardrails.”

A New Defense Against Jailbreaking

Recently, Anthropic tested and released a new defense against jailbreaking in its LLM Claude, which includes a barrier that works to keep jailbreakers out and prevents jailbroken responses from being generated. The defense revolves around Constitutional Classifiers, which Anthropic describes as “input and output classifiers trained on synthetically generated data.” Anthropic defines what is allowed and not allowed in terms of questions through the Constitutional Classifiers, reducing most jailbreaks while limiting over-refusals.

Aman Priyanshu, an AI Safety and Privacy Researcher at Cisco, said older LLM models used reinforced learning to prevent jailbreaking issues, but this required extensive training efforts, including annotating and collecting or sourcing the model from other service providers. He explained that with Constitutional Classifiers, Anthropic created a set of rules, such as not responding to questions about weapons or poison creation. For example, Claude will respond to questions about mustard, but not mustard gas.

“Anthropic created a constitution of things that this model should do and should not do, or should answer and should not answer. And they were able to train smaller classifiers using a very small number of parameters attached to the end of the cloud, which enabled them to basically deliver really quickly and at scale,” said Priyanshu.

While the defense produced successful results, Mrinank Sharma, who leads the Safeguards Research Team at Anthropic, told MIT Technology Review that they are not saying the system is bulletproof. “You know, it’s common wisdom in security that no system is perfect. It’s more like, ‘How much effort would it take to get one of these jailbreaks through?’ If the amount of effort is high enough, that deters a lot of people,” said Sharma.

Organizations Turning to Additional Defenses

In addition to Constitution Classifiers, other defenses are available to help reduce jailbreaking. Priyanshu said alignment training is a popular defense. Alignment training is reinforcement learning using human feedback, where you tell the models what to not output and then train the model to refuse to make certain responses. However, this process became less effective once automated techniques were developed. Priyanshu said researchers are also working on introducing small classifiers in front of the models to act like a filter.

Shu said that to make progress as an industry, organizations need to share techniques and benchmarks with the research community.

“Our recent research highlights the value of open-source adversarial datasets and standardized safety evaluations to strengthen model robustness collectively,” said Shu. “Additionally, multi-layered defense strategies—combining rule-based filtering, knowledge editing, and reinforcement learning with human feedback—can significantly reduce the effectiveness of jailbreak attempts by making models more resilient to adversarial manipulation.”

Bates said that training and caution are the foundations of a solid jailbreaking defense, combined with a proxy server. “We don’t let our employees access the Internet freely in the company. We put a proxy server in place to say, ‘You can’t go to this domain. You can’t go to that domain because we don’t want to be held responsible.’ Same thing with any AI that you’re using from a corporate AI standpoint. You want to have your own proxies in place so that it’s giving and doing exactly what it’s supposed to do,” said Bates.

However, jailbreaking defenses also carry the risk of not answering a legitimately appropriate question. For example, students learning about nuclear physics may be blocked from obtaining information, depending on the search terms they use.

Future of Jailbreaking Defenses

With jailbreaking tactics and defenses evolving rapidly, there will likely be many changes in the near future. Shu said that in the near term, defenses will likely focus on existing vulnerabilities and refining prompt filtering mechanisms.

“As attackers develop more sophisticated techniques, these approaches may become less effective,” said Shu. “In the long term, there is a pressing need for scalable and generalizable safety mechanisms, such as knowledge-editing frameworks that allow real-time intervention without requiring full model retraining. These adaptive mechanisms could offer more robust protection against emerging jailbreak methods.”

Priyanshu said the biggest risk is organizations assuming their jailbreaking defenses are 100% effective.

“At the end of the day, this is very similar to how cybersecurity works, where there’s going to be a rat race between the red teams, or the people who are going to jailbreak, hack, or break these models, and the blue [defensive] teams that are working on the different strategies. Both are going to be outcompeting each other, but progressively,” said Priyanshu.

“The best strategy is being involved with what kind of malicious inputs can come to you or what threat actors are asking for,” he said.

Jennifer Goforth Gregory is a technology journalist who has covered B2B tech for over 20 years. In her spare time, she rescues homeless dachshunds.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签