MarkTechPost@AI 2024年07月15日
Samsung Researchers Introduce LoRA-Guard: A Parameter-Efficient Guardrail Adaptation Method that Relies on Knowledge Sharing between LLMs and Guardrail Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

三星研究院提出了一种名为LoRA-Guard的新型系统,它可以有效地将聊天模型与安全防护模型整合,以解决大型语言模型(LLM)安全问题。LoRA-Guard利用低秩适配器,将安全防护功能集成到聊天模型的Transformer主干中,从而减少了参数开销,提高了效率。该系统在各种数据集上进行了评估,并展现出优异的性能。

🤗 LoRA-Guard是一个创新系统,它通过将安全防护模型与聊天模型整合,有效地解决了LLM的安全问题。它利用低秩适配器,将安全防护功能集成到聊天模型的Transformer主干中,从而减少了参数开销,提高了效率。

🤖 LoRA-Guard采用双路径设计,能够在聊天和安全防护功能之间无缝切换。通过激活或停用LoRA适配器以及切换输出头,系统可以执行任一任务,而不会降低性能。参数共享机制显著降低了计算开销,安全防护模型通常只增加了原始模型参数的一小部分(通常为1/1000)。

🧠 LoRA-Guard通过在标记数据集上对f'和hguard进行监督微调来进行训练,同时保持聊天模型参数不变。这种方法利用了聊天模型现有的知识,同时有效地学习了检测有害内容的能力。

📊 LoRA-Guard在多个数据集上展现出卓越的性能。在ToxicChat数据集上,它在AUPRC方面优于基线模型,同时使用的参数量少得多,最多比完全微调的模型少1500倍。在OpenAIModEval数据集上,它与其他方法的性能相当,但使用的参数量少100倍。跨域评估揭示了有趣的非对称性:在ToxicChat上训练的模型可以很好地推广到OpenAIModEval,但反过来却会大幅下降。这种非对称性可能是由于数据集特征的差异或ToxicChat中存在破解样本造成的。总的来说,LoRA-Guard被证明是语言模型中内容审核的有效解决方案。

🚀 LoRA-Guard代表了对话系统安全领域的一项重大突破,将安全防护参数开销降低了100-1000倍,同时保持或提高了性能。这种效率是通过知识共享和参数高效学习机制实现的。它的双路径设计可以防止微调过程中的灾难性遗忘,这是其他方法中常见的问题。通过大幅减少训练时间、推断时间和内存需求,LoRA-Guard成为了在资源受限的环境中实施稳健内容审核的关键发展。随着设备端LLM的普及,LoRA-Guard为更广泛的应用和设备上的更安全的AI交互铺平了道路。

Large Language Models (LLMs) have demonstrated remarkable proficiency in language generation tasks. However, their training process, which involves unsupervised learning from extensive datasets followed by supervised fine-tuning, presents significant challenges. The primary concern stems from the nature of pre-training datasets, such as Common Crawl, which often contain undesirable content. Consequently, LLMs inadvertently acquire the ability to generate offensive language and potentially harmful advice. This unintended capability poses a serious safety risk, as these models can produce coherent responses to user inputs without proper content filtering. The challenge for researchers lies in developing methods to maintain the LLMs’ language generation capabilities while effectively mitigating the production of unsafe or unethical content.

Existing attempts to overcome the safety concerns in LLMs have primarily focused on two approaches: safety tuning and the implementation of guardrails. Safety tuning aims to optimize models to respond in a manner aligned with human values and safety considerations. However, these chat models remain vulnerable to jailbreak attacks, which employ various strategies to circumvent safety measures. These strategies include using low-resource languages, refusal suppression, privilege escalation, and distractions.

To counter these vulnerabilities, researchers have developed guardrails to monitor exchanges between chat models and users. One notable approach involves the use of model-based guardrails, which are separate from the chat models themselves. These guard models are designed to flag harmful content and serve as a critical component of AI safety stacks in deployed systems.

However, the current methods face significant challenges. The use of separate guard models introduces substantial computational overhead, making them impractical in low-resource settings. Also, the learning process is inefficient due to the considerable overlap in language understanding abilities between chat models and guard models, as both need to perform their respective tasks of response generation and content moderation effectively.

Samsung R&D Institute researchers present LoRA-Guard, an innovative system that integrates chat and guard models, addressing efficiency issues in LLM safety. It uses a low-rank adapter on a chat model’s transformer backbone to detect harmful content. The system operates in dual modes: activating LoRA parameters for guardrailing with a classification head, and deactivating them for normal chat functions. This approach significantly reduces parameter overhead by 100-1000x compared to previous methods, making deployment feasible in resource-constrained settings. LoRA-Guard has been evaluated on various datasets, including zero-shot scenarios, and its model weights have been published to support further research.

LoRA-Guard’s architecture is designed to efficiently integrate guarding capabilities into a chat model. It uses the same embedding and tokenizer for both the chat model C and the guard model G. The key innovation lies in the feature map: while C uses the original feature map f, G employs f’ with LoRA adapters attached to f. G also utilizes a separate output head hguard for classification into harmfulness categories.

This dual-path design allows for seamless switching between chat and guard functions. By activating or deactivating LoRA adapters and switching between output heads, the system can perform either task without performance degradation. The parameter sharing between paths significantly reduces the computational overhead, with the guard model typically adding only a fraction (often 1/1000th) of the original model’s parameters.

LoRA-Guard is trained through supervised fine-tuning of f’ and hguard on labeled datasets, keeping the chat model’s parameters frozen. This approach utilizes the chat model’s existing knowledge while learning to detect harmful content efficiently.

LoRA-Guard demonstrates exceptional performance on multiple datasets. On ToxicChat, it outperforms baselines in AUPRC while using significantly fewer parameters – up to 1500 times less than fully fine-tuned models. For OpenAIModEval, it matches alternative methods with 100 times fewer parameters. Cross-domain evaluations reveal interesting asymmetries: models trained on ToxicChat generalize well to OpenAIModEval, but the reverse shows considerable performance drops. This asymmetry might be due to differences in dataset characteristics or the presence of jailbreak samples in ToxicChat. Overall, LoRA-Guard proves to be an efficient and effective solution for content moderation in language models.

LoRA-Guard represents a significant leap in moderated conversational systems, reducing guardrailing parameter overhead by 100-1000 times while maintaining or improving performance. This efficiency is achieved through knowledge sharing and parameter-efficient learning mechanisms. Its dual-path design prevents catastrophic forgetting during fine-tuning, a common issue in other approaches. By dramatically reducing training time, inference time, and memory requirements, LoRA-Guard emerges as a crucial development for implementing robust content moderation in resource-constrained environments. As on-device LLMs become more prevalent, LoRA-Guard paves the way for safer AI interactions across a broader range of applications and devices.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Samsung Researchers Introduce LoRA-Guard: A Parameter-Efficient Guardrail Adaptation Method that Relies on Knowledge Sharing between LLMs and Guardrail Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LoRA-Guard LLM安全 内容审核 参数高效 AI安全
相关文章