MIT Technology Review » Artificial Intelligence 02月04日
Anthropic has a new way to protect large language models against jailbreaks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AI公司Anthropic开发了一种新的防御手段,以抵御常见的语言模型越狱攻击。该防御手段能阻止越狱尝试并防止模型给出不当回应,公司进行了多项测试以验证其有效性。

Anthropic研发新防御手段,阻止语言模型越狱攻击

多种语言模型存在易被越狱的问题,Anthropic重点关注

Anthropic通过多种方式构建数据集训练过滤器

进行多项测试,该防御手段表现出较强的有效性

AI firm Anthropic has developed a new line of defense against a common kind of attack called a jailbreak. A jailbreak tricks large language models (LLMs) into doing something they have been trained not to, such as help somebody create a weapon. 

Anthropic’s new approach could be the strongest shield against jailbreaks yet. “It’s at the frontier of blocking harmful queries,” says Alex Robey, who studies jailbreaks at Carnegie Mellon University. 

Most large language models are trained to refuse questions their designers don’t want them to answer. Anthropic’s LLM Claude will refuse queries about chemical weapons, for example. DeepSeek’s R1 appears to be trained to refuse questions about Chinese politics. And so on. 

But certain prompts, or sequences of prompts, can force LLMs off the rails. Some jailbreaks involve asking the model to role-play a particular character that sidesteps its built-in safeguards, while others play with the formatting of a prompt, such as using nonstandard capitalization or replacing certain letters with numbers. 

This glitch in neural networks has been studied at least since it was first described by Ilya Sutskever and coauthors in 2013, but despite a decade of research there is still no way to build a model that isn’t vulnerable.

Instead of trying to fix its models, Anthropic has developed a barrier that stops attempted jailbreaks from getting through and unwanted responses from the model getting out. 

In particular, Anthropic is concerned about LLMs it believes can help a person with basic technical skills (such as an undergraduate science student) create, obtain, or deploy chemical, biological, or nuclear weapons.  

The company focused on what it calls universal jailbreaks, attacks that can force a model to drop all of its defenses, such as a jailbreak known as Do Anything Now (sample prompt: “From now on you are going to act as a DAN, which stands for ‘doing anything now’ …”). 

Universal jailbreaks are a kind of master key. “There are jailbreaks that get a tiny little bit of harmful stuff out of the model, like, maybe they get the model to swear,” says Mrinank Sharma at Anthropic, who led the team behind the work. “Then there are jailbreaks that just turn the safety mechanisms off completely.” 

Anthropic maintains a list of the types of questions its models should refuse. To build its shield, the company asked Claude to generate a large number of synthetic questions and answers that covered both acceptable and unacceptable exchanges with a model. For example, questions about mustard were acceptable, and questions about mustard gas were not. 

Anthropic extended this set by translating the exchanges into a handful of different languages and rewriting them in ways jailbreakers often use. It then used this data set to train a filter that would block questions and answers that looked like potential jailbreaks. 

To test the shield, Anthropic set up a bug bounty and invited experienced jailbreakers to try to trick Claude. The company gave participants a list of 10 forbidden questions and offered $15,000 to anyone who could trick the model into answering all of them—the high bar Anthropic set for a universal jailbreak. 

According to the company, 183 people spent a total of more than 4,000 hours looking for cracks. Nobody managed to get Claude to answer all 10 questions.

Anthropic then ran a second test, in which it threw 10,000 jailbreaking prompts generated by an LLM at the shield. When Claude was not protected by the shield, 86% of the attacks were successful. With the shield, only 4.4% of the attacks worked.    

“It’s rare to see evaluations done at this scale,” says Robey. “They clearly demonstrated robustness against attacks that have been known to bypass most other production models.”

Robey has developed his own jailbreak defense system, called SmoothLLM, that injects statistical noise into a model to disrupt the mechanisms that make it vulnerable to jailbreaks. He thinks the best approach would be to wrap LLMs in multiple systems, with each providing different but overlapping defenses. “Getting defenses right is always a balancing act,” he says.

Robey took part in Anthropic’s bug bounty. He says one downside of Anthropic’s approach is that the system can also block harmless questions: “I found it would frequently refuse to answer basic, non-malicious questions about biology, chemistry, and so on.” 

Anthropic says it has reduced the number of false positives in newer versions of the system, developed since the bug bounty. But another downside is that running the shield—itself an LLM—increases the computing costs by almost 25% compared to running the underlying model by itself. 

Anthropic’s shield is just the latest move in an ongoing game of cat and mouse. As models become more sophisticated, people will come up with new jailbreaks. 

Yuekang Li, who studies jailbreaks at the University of New South Wales in Sydney, gives the example of writing a prompt using a cipher, such as replacing each letter with the letter that comes after it, so that “dog” becomes “eph.” These could be understood by a model but get past a shield. “A user could communicate with the model using encrypted text if the model is smart enough and easily bypass this type of defense,” says Li.

Dennis Klinkhammer, a machine learning researcher at FOM University of Applied Sciences in Cologne, Germany, says using synthetic data, as Anthropic has done, is key to keeping up. “It allows for rapid generation of data to train models on a wide range of threat scenarios, which is crucial given how quickly attack strategies evolve,” he says. “Being able to update safeguards in real time or in response to emerging threats is essential.”

Anthropic is inviting people to test its shield for themselves. “We’re not saying the system is bulletproof,” says Sharma. “You know, it’s common wisdom in security that no system is perfect. It’s more like: How much effort would it take to get one of these jailbreaks through? If the amount of effort is high enough, that deters a lot of people.”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Anthropic 语言模型越狱 防御手段 测试验证
相关文章