MarkTechPost@AI 2024年07月03日
WildGuard: A Light-weight, Multi-Purpose Moderation Tool for Assessing the Safety of User-LLM Interactions
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

WildGuard 是一个轻量级的多功能工具,用于评估用户与大型语言模型 (LLM) 交互的安全性。它旨在解决现有 LLM 安全评估工具的局限性,例如无法有效检测对抗性提示、在微妙的拒绝检测方面效率较低,以及过度依赖成本高昂且不稳定的基于 API 的解决方案。WildGuard 通过构建一个大型、平衡的多任务安全审核数据集 WILDGUARDMIX 来实现这一点,该数据集包含 92,000 个标记示例,涵盖了 13 个风险类别,包括直接和对抗性提示,以及拒绝和遵从响应。WildGuard 利用多任务学习来增强其审核能力,在开源安全审核中取得了最先进的性能。

😊 WildGuard 的核心是 WILDGUARDMIX 数据集,该数据集包含 WILDGUARDTRAIN 和 WILDGUARDTEST 子集。WILDGUARDTRAIN 包含来自合成和现实世界来源的 86,759 个项目,涵盖了普通和对抗性提示,并包含各种良性和有害提示以及相应的响应。WILDGUARDTEST 是一个高质量的人工标注评估集,包含 5,299 个项目。关键技术方面包括使用各种 LLM 生成响应、详细的过滤和审核过程以确保数据质量,以及使用 GPT-4 进行标注和生成复杂的响应以提高分类器性能。

🤔 WildGuard 在所有审核任务中均表现出优异的性能,超过了现有的开源工具,并在各种基准测试中通常与 GPT-4 相匹配或超过 GPT-4。关键指标包括拒绝检测提高了 26.4%,提示有害性识别提高了 3.9%。WildGuard 在响应有害性检测中取得了 94.7% 的 F1 分数,在拒绝检测中取得了 92.8% 的 F1 分数,明显优于 Llama-Guard2 和 Aegis-Guard 等其他模型。这些结果强调了 WildGuard 在处理对抗性和普通提示场景方面的有效性和可靠性,使其成为一个强大且高效的安全审核工具。

😎 WildGuard 代表了 LLM 安全审核的重大进步,以全面的开源解决方案解决了关键挑战。贡献包括引入了 WILDGUARDMIX,这是一个用于训练和评估的强大数据集,以及开发了 WildGuard,一个最先进的审核工具。这项工作有可能提高 LLM 的安全性和可信度,为它们在敏感和高风险领域中的更广泛应用铺平道路。

Ensuring the safety and moderation of user interactions with modern Language Models (LLMs) is a crucial challenge in AI. These models, if not properly safeguarded, can produce harmful content, fall victim to adversarial prompts (jailbreaks), and inadequately refuse inappropriate requests. Effective moderation tools are necessary to identify malicious intent, detect safety risks, and evaluate the refusal rate of models, thus maintaining trust and applicability in sensitive domains like healthcare, finance, and social media.

Existing methods for moderating LLM interactions include tools like Llama-Guard and various other open-source moderation models. These tools typically focus on detecting harmful content and assessing safety in model responses. However, they have several limitations: they struggle to detect adversarial jailbreaks effectively, are less efficient in nuanced refusal detection, and often rely heavily on API-based solutions like GPT-4, which are costly and non-static. These methods also lack comprehensive training datasets that cover a wide range of risk categories, limiting their applicability and performance in real-world scenarios where adversarial and benign prompts are common.

A team of researchers from the Allen Institute for AI, the University of Washington, and Seoul National University propose WILDGUARD, a novel, lightweight moderation tool designed to address the limitations of existing methods. WILDGUARD stands out by providing a comprehensive solution for identifying malicious prompts, detecting safety risks, and evaluating model refusal rates. The innovation lies in its construction of WILDGUARDMIX, a large-scale, balanced multi-task safety moderation dataset comprising 92,000 labeled examples. This dataset includes both direct and adversarial prompts paired with refusal and compliance responses, covering 13 risk categories. WILDGUARD’s approach leverages multi-task learning to enhance its moderation capabilities, achieving state-of-the-art performance in open-source safety moderation.

WILDGUARD’s technical backbone is its WILDGUARDMIX dataset, which consists of WILDGUARDTRAIN and WILDGUARDTEST subsets. WILDGUARDTRAIN includes 86,759 items from synthetic and real-world sources, covering vanilla and adversarial prompts. It also features a diverse mix of benign and harmful prompts with corresponding responses. WILDGUARDTEST is a high-quality, human-annotated evaluation set with 5,299 items. Key technical aspects include the use of various LLMs for generating responses, detailed filtering, and auditing processes to ensure data quality, and the employment of GPT-4 for labeling and generating complex responses to enhance classifier performance.

WILDGUARD demonstrates superior performance across all moderation tasks, outshining existing open-source tools and often matching or exceeding GPT-4 in various benchmarks. Key metrics include up to 26.4% improvement in refusal detection and up to 3.9% improvement in prompt harmfulness identification. WILDGUARD achieves an F1 score of 94.7% in response harmfulness detection and 92.8% in refusal detection, significantly outperforming other models like Llama-Guard2 and Aegis-Guard. These results underscore WILDGUARD’s effectiveness and reliability in handling both adversarial and vanilla prompt scenarios, establishing it as a robust and highly efficient safety moderation tool.

In conclusion, WILDGUARD represents a significant advancement in LLM safety moderation, addressing critical challenges with a comprehensive, open-source solution. Contributions include the introduction of WILDGUARDMIX, a robust dataset for training and evaluation, and the development of WILDGUARD, a state-of-the-art moderation tool. This work has the potential to enhance the safety and trustworthiness of LLMs, paving the way for their broader application in sensitive and high-stakes domains.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit

The post WildGuard: A Light-weight, Multi-Purpose Moderation Tool for Assessing the Safety of User-LLM Interactions appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

WildGuard LLM 安全审核 人工智能 多任务学习
相关文章