MarkTechPost@AI 2024年07月03日
45 Shades of AI Safety: SORRY-Bench’s Innovative Taxonomy for LLM Refusal Behavior Analysis
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SORRY-Bench是一个用于评估大型语言模型(LLM)安全拒绝行为的基准测试。该基准测试采用了细粒度的45类安全分类法,涵盖了四个高级领域,并包含一个包含450条指令的平衡数据集,以及9000条带有20种语言变体的额外提示。SORRY-Bench还包含一个大规模的人工判断数据集,并探索了最佳的自动化评估方法。通过评估40多个LLM,SORRY-Bench提供了对各种拒绝行为的见解。

😄 **SORRY-Bench 细粒度安全分类法:** SORY-Bench 采用了细粒度的45类安全分类法,涵盖了四个高级领域,包括“有害内容”、“非法活动”、“欺骗和操纵”以及“社会危害”。这种细粒度的分类法能够更全面地评估 LLM 在不同安全方面的拒绝行为,弥补了现有基准测试中安全类别不一致和粒度粗糙的缺陷。

🤔 **SORRY-Bench 评估方法:** SORY-Bench 采用二元分类方法来评估模型是否拒绝了不安全的指令。为了确保评估的准确性,研究人员创建了一个包含超过7200个标注的大规模人工判断数据集,涵盖了分布内和分布外的情况。该数据集可作为评估自动化安全评估器和训练语言模型的基准。

💪 **SORRY-Bench 评估结果:** SORY-Bench 评估了40多个LLM,结果显示模型在不同安全类别上的拒绝行为存在显著差异。例如,在“骚扰”、“与儿童相关的犯罪”和“性犯罪”等类别中,模型的拒绝率最高,平均拒绝率为10-11%。相反,大多数模型在提供法律建议方面表现出高度的服从性。

📊 **SORRY-Bench 语言变异分析:** SORY-Bench 还探索了20种不同的语言变异,例如问题式措辞、技术术语、多语言提示、编码和加密策略等,以研究这些变异对模型安全行为的影响。研究发现,这些变异对不同模型的拒绝率有不同的影响。例如,问题式措辞会略微提高大多数模型的拒绝率,而技术术语会导致所有模型的拒绝率降低8-18%。

Large language models (LLMs) have gained significant attention in recent years, but ensuring their safe and ethical use remains a critical challenge. Researchers are focused on developing effective alignment procedures to calibrate these models to adhere to human values and safely follow human intentions. The primary goal is to prevent LLMs from engaging in unsafe or inappropriate user requests. Current methodologies face challenges in comprehensively evaluating LLM safety, including aspects such as toxicity, harmfulness, trustworthiness, and refusal behaviors. While various benchmarks have been proposed to assess these safety aspects, there is a need for a more robust and comprehensive evaluation framework to ensure LLMs can effectively refuse inappropriate requests across a wide range of scenarios.

Researchers have proposed various approaches to evaluate the safety of modern Large Language Models (LLMs) with instruction-following capabilities. These efforts build upon earlier work that assessed toxicity and bias in pretrained LMs using simple sentence-level completion or knowledge QA tasks. Recent studies have introduced instruction datasets designed to trigger potentially unsafe behavior in LLMs. These datasets typically contain varying numbers of unsafe user instructions across different safety categories, such as illegal activities and misinformation. LLMs are then tested with these unsafe instructions, and their responses are evaluated to determine model safety. However, existing benchmarks often use inconsistent and coarse-grained safety categories, leading to evaluation challenges and incomplete coverage of potential safety risks.

Researchers from Princeton University, Virginia Tech, Stanford University, UC Berkeley, University of Illinois at Urbana-Champaign, and the University of Chicago present SORRY-Bench, addressing three key deficiencies in existing LLM safety evaluations. First, it introduces a fine-grained 45-class safety taxonomy across four high-level domains, unifying disparate taxonomies from prior work. This comprehensive taxonomy captures diverse potentially unsafe topics and allows for more granular safety refusal evaluation. Second, the SORRY-Bench ensures balance not only across topics but also over linguistic characteristics. It considers 20 diverse linguistic mutations that real-world users might apply to phrase unsafe prompts, including different writing styles, persuasion techniques, encoding strategies, and multiple languages. Lastly, the benchmark investigates design choices for fast and accurate safety evaluation, exploring the trade-off between efficiency and accuracy in LLM-based safety judgments. This systematic approach aims to provide a more robust and comprehensive framework for evaluating LLM safety refusal behaviors.

SORRY-Bench introduces a sophisticated evaluation framework for LLM safety refusal behaviors. The benchmark employs a binary classification approach to determine whether a model’s response fulfills or refuses an unsafe instruction. To ensure an accurate evaluation, the researchers curated a large-scale human judgment dataset of over 7,200 annotations, covering both in-distribution and out-of-distribution cases. This dataset serves as a foundation for evaluating automated safety evaluators and training language model-based judges. Researchers conducted a comprehensive meta-evaluation of various design choices for safety evaluators, exploring different LLM sizes, prompting techniques, and fine-tuning approaches. Results showed that fine-tuned smaller-scale LLMs (e.g., 7B parameters) can achieve comparable accuracy to larger models like GPT-4, with substantially lower computational costs.

SORRY-Bench evaluates over 40 LLMs across 45 safety categories, revealing significant variations in safety refusal behaviors. Key findings include:

These results provide insights into the varying safety priorities of model creators and the impact of different prompt formulations on LLM safety behaviors.

SORRY-Bench introduces a comprehensive framework for evaluating LLM safety refusal behaviors. It features a fine-grained taxonomy of 45 unsafe topics, a balanced dataset of 450 instructions, and 9,000 additional prompts with 20 linguistic variations. The benchmark includes a large-scale human judgment dataset and explores optimal automated evaluation methods. By assessing over 40 LLMs, SORRY-Bench provides insights into diverse refusal behaviors. This systematic approach offers a balanced, granular, and efficient tool for researchers and developers to improve LLM safety, ultimately contributing to more responsible AI deployment.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 45k+ ML SubReddit

The post 45 Shades of AI Safety: SORRY-Bench’s Innovative Taxonomy for LLM Refusal Behavior Analysis appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 安全 SORRY-Bench 安全评估 拒绝行为
相关文章