MarkTechPost@AI 04月03日 09:35
Salesforce AI Introduce BingoGuard: An LLM-based Moderation System Designed to Predict both Binary Safety Labels and Severity Levels
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Salesforce AI推出了BingoGuard,一款基于大型语言模型(LLM)的审核系统,旨在改进传统二元分类的不足。BingoGuard不仅预测内容的安全性,还评估其危害程度,通过11个特定类别和5个严重级别,实现更精细化的内容管理。该系统采用“生成-过滤”方法构建训练数据集,并通过专门的LLM微调,从而在内容审核的准确性和灵活性上表现出色,特别是在识别较低严重程度内容方面。BingoGuard的推出,为平台提供了更精确、更敏感的内容审核框架,有效降低了过度审核和审核不足的风险。

🛡️ BingoGuard旨在解决传统审核系统的局限性,这些系统通常采用二元分类,无法有效区分不同程度的有害内容,导致过度限制或过滤不足。

🎯 BingoGuard采用结构化分类,将有害内容分为11个特定类别,例如暴力犯罪、性内容等,每个类别又细分为5个严重级别(从0到4级)。

⚙️ BingoGuard使用“生成-过滤”方法构建训练数据集BingoGuardTrain,包含54,897个条目,涵盖多种严重程度和内容风格。专门的LLM针对每个严重级别进行微调,确保输出符合预定义的严重程度标准。

📈 实验评估表明,BingoGuard-8B在检测准确性方面优于WildGuard和ShieldGemma等领先的审核模型,尤其在识别较低严重程度内容方面表现出色,证明了其在细粒度内容审核方面的优势。

The advancement of large language models (LLMs) has significantly influenced interactive technologies, presenting both benefits and challenges. One prominent issue arising from these models is their potential to generate harmful content. Traditional moderation systems, typically employing binary classifications (safe vs. unsafe), lack the necessary granularity to distinguish varying levels of harmfulness effectively. This limitation can lead to either excessively restrictive moderation, diminishing user interaction, or inadequate filtering, which could expose users to harmful content.

Salesforce AI introduces BingoGuard, an LLM-based moderation system designed to address the inadequacies of binary classification by predicting both binary safety labels and detailed severity levels. BingoGuard utilizes a structured taxonomy, categorizing potentially harmful content into eleven specific areas, including violent crime, sexual content, profanity, privacy invasion, and weapon-related content. Each category incorporates five clearly defined severity levels ranging from benign (level 0) to extreme risk (level 4). This structure enables platforms to calibrate their moderation settings precisely according to their specific safety guidelines, ensuring appropriate content management across varying severity contexts.

From a technical perspective, BingoGuard employs a “generate-then-filter” methodology to assemble its comprehensive training dataset, BingoGuardTrain, consisting of 54,897 entries spanning multiple severity levels and content styles. This framework initially generates responses tailored to different severity tiers, subsequently filtering these outputs to ensure alignment with defined quality and relevance standards. Specialized LLMs undergo individual fine-tuning processes for each severity tier, using carefully selected and expertly audited seed datasets. This fine-tuning guarantees that generated outputs adhere closely to predefined severity rubrics. The resultant moderation model, BingoGuard-8B, leverages this meticulously curated dataset, enabling precise differentiation among various degrees of harmful content. Consequently, moderation accuracy and flexibility are significantly enhanced.

Empirical evaluations of BingoGuard indicate strong performance. Testing against BingoGuardTest, an expert-labeled dataset comprising 988 examples, revealed that BingoGuard-8B achieves higher detection accuracy than leading moderation models such as WildGuard and ShieldGemma, with improvements of up to 4.3%. Notably, BingoGuard demonstrates superior accuracy in identifying lower-severity content (levels 1 and 2), traditionally difficult for binary classification systems. Additionally, in-depth analyses uncovered a relatively weak correlation between predicted “unsafe” probabilities and the actual severity level, underscoring the necessity of explicitly incorporating severity distinctions. These findings illustrate fundamental gaps in current moderation methods that primarily rely on binary classifications.

In conclusion, BingoGuard enhances the precision and effectiveness of AI-driven content moderation by integrating detailed severity assessments alongside binary safety evaluations. This approach allows platforms to handle moderation with greater accuracy and sensitivity, minimizing the risks associated with both overly cautious and insufficient moderation strategies. Salesforce’s BingoGuard thus provides an improved framework for addressing the complexities of content moderation within increasingly sophisticated AI-generated interactions.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post Salesforce AI Introduce BingoGuard: An LLM-based Moderation System Designed to Predict both Binary Safety Labels and Severity Levels appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

BingoGuard LLM 内容审核 人工智能
相关文章