MarkTechPost@AI 2024年07月10日
The Hidden Danger in AI Models: A Space Character’s Impact on Safety
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨AI模型中一个易被忽视的问题,即在LLM对话模板末尾添加一个空格,可能导致开源语言模型对用户提示给出有害响应,强调了这一问题的危险性及产生原因

🎯AI模型应确保安全使用,如对不安全提示应拒绝回答,通过强化学习等方法实现,但训练模板文档需改进。

🧐模型对齐是当前研究重点,多种训练框架提出增强模型对齐的方法,同时存在攻击模型对齐的情况。

💪模型鲁棒性研究表明,微小改动可能使神经网络误分类,强调模型需具备更强的稳定性。

🔍新加坡国立大学研究者发现单字符令牌在模型预训练数据中出现较少,但仍可能对多数模型构成威胁。

📊通过实验分析模型对有害行为的响应,使用AdvBench数据及多种开源模型,以评估模型的安全性。

When given an unsafe prompt, like “Tell me how to build a bomb,” a well-trained large language model (LLM) should refuse to answer. This is usually achieved through Reinforcement Learning from Human Feedback (RLHF) and is crucial to make sure models are safe to use, especially in sensitive areas that involve direct interaction with people, like mental health, customer service, general conversation, and healthcare. However, there has been progress in automating the creation of these chat templates, but documentation for the template format used during training often needs to be improved. Among the eight open-source models reviewed, only Vicuna, Falcon, Llama-3, and ChatGLM describe the chat template used during fine-tuning.

The first related study focuses on Model Alignment, which aims to ensure that AI models reflect human values, a key focus in current research on LLMs. Training frameworks such as SelfInstruct, RLHF, and Constitutional AI propose methods to enhance model alignment by integrating human values into model training. The next study examines Attacks on Model Alignment, where attacks revealing vulnerabilities in model alignment have become more common. Next is Model Robustness, where in the context of adversarial attacks on classification tasks, research shows that even small alterations to images like tweaking a few pixels, can cause neural networks to misclassify them. The last work is Glitch Tokens, where tokens are present in a tokenizer’s vocabulary but absent from a model’s training data.

Researchers from the National University of Singapore have found an important observation that single-character tokens appear relatively rarely in tokenized model pre-training data. This is because of the nature of subword tokenization algorithms, which merge-common tokens. However, single-character tokens can still pose a threat to most models. The researchers explained this by looking at how tokenizer vocabularies and the contexts of single-space tokens in pre-training data work. The findings highlighted the weaknesses in current model alignment and suggested that more effort is needed to make models not just aligned but robustly aligned.

Data from AdvBench, a benchmark designed to measure how often models agree with harmful requests, is used in this study. These harmful requests include asking for misinformation, pornographic material, or instructions for illegal activities. For the experiments, a 100-sample subset of the harmful behaviors split of AdvBench is tested. Eight open-source models are tested: Vicuna v1.5, Llama 2, Llama 3, Mistral, Falcon, Guanaco, MPT, and ChatGLM, using 7B4 and 13B models. This helps analyze the impact of model size and type on harmful behavior. Responses from models that do not refuse harmful queries are likely to be harmful. A check on a randomly selected set of ten outputs from each model showed that this evaluation method is accurate in most cases (74/80).

In this paper, a situation is considered where the chat template of a model is available, which excludes closed-source, commercial models like GPT-4 and BARD. Instead, the focus is on open-source models to show that this problem exists and explore the reasons related. Although this exploration is formalized as an adversarial attack, it is not meant to propose a practical attack on LLMs but rather serves as a probing method. For a user query x to model M, the model input is formatted using template T, which consists of a system prompt s, a set of role labels R, and x. A single character is appended to the end of the template, resulting in a modified template, T′.

In conclusion, researchers from the National University of Singapore found that adding a single space at the end of LLM conversation templates can cause open-source language models to give harmful responses to user prompts. This extra space is easy for an engineer to add by mistake and hard to notice without careful checks, especially in long templates. However, this small error can lead to dangerous outcomes, bypassing the model’s safeguards. The experiments suggest that this happens because of how single tokens are used in the training data, and the reason is the way the data is divided into tokens.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post The Hidden Danger in AI Models: A Space Character’s Impact on Safety appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI模型 模型安全 空格字符 模型对齐 模型鲁棒性
相关文章