MarkTechPost@AI 2024年12月14日
IBM Open-Sources Granite Guardian: A Suite of Safeguards for Risk Detection in LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型虽带来机遇,但部署中存在诸多挑战。IBM 推出 Granite Guardian 开源套件,用于检测和缓解 LLMs 风险,涵盖多种风险维度,其模型具有多种优势,经广泛基准测试证明其有效性。

🎯Granite Guardian 可检测多种风险,包括有害内容、幻觉等

💻提供两种参数模型,集成多样数据源,增强通用性

🔄可融入现有 AI 工作流,具有高适用性和可扩展性

🌐开源性质促进社区驱动增强,提升 AI 安全实践

The rapid advancements in large language models (LLMs) have introduced significant opportunities for various industries. However, their deployment in real-world scenarios also presents challenges, such as generating harmful content, hallucinations, and potential ethical misuse. LLMs can produce socially biased, violent, or profane outputs, and adversarial actors often exploit vulnerabilities through jailbreaks to bypass safety measures. Another critical issue lies in retrieval-augmented generation (RAG) systems, where LLMs integrate external data but may provide contextually irrelevant or factually incorrect responses. Addressing these challenges requires robust safeguards to ensure responsible and safe AI usage.

To address these risks, IBM has introduced Granite Guardian, an open-source suite of safeguards for risk detection in LLMs. This suite is designed to detect and mitigate multiple risk dimensions. The Granite Guardian suite identifies harmful prompts and responses, covering a broad spectrum of risks, including social bias, profanity, violence, unethical behavior, sexual content, and hallucination-related issues specific to RAG systems. Released as part of IBM’s open-source initiative, Granite Guardian aims to promote transparency, collaboration, and responsible AI development. With comprehensive risk taxonomy and training datasets enriched by human annotations and synthetic adversarial samples, this suite provides a versatile approach to risk detection and mitigation.

Technical Details

Granite Guardian’s models, based on IBM’s Granite 3.0 framework, are available in two variants: a lightweight 2-billion parameter model and a more comprehensive 8-billion parameter version. These models integrate diverse data sources, including human-annotated datasets and adversarially generated synthetic samples, to enhance their generalizability across diverse risks. The system effectively addresses jailbreak detection, often overlooked by traditional safety frameworks, using synthetic data designed to mimic sophisticated adversarial attacks. Additionally, the models incorporate capabilities to address RAG-specific risks such as context relevance, groundedness, and answer relevance, ensuring that generated outputs align with user intents and factual accuracy.

A notable feature of Granite Guardian is its adaptability. The models can be integrated into existing AI workflows as real-time guardrails or evaluators. Their high-performance metrics, including AUC scores of 0.871 and 0.854 for harmful content and RAG-hallucination benchmarks, respectively, demonstrate their applicability across diverse scenarios. Furthermore, the open-source nature of Granite Guardian encourages community-driven enhancements, fostering improvements in AI safety practices.

Insights and Results

Extensive benchmarking highlights the efficacy of Granite Guardian. On public datasets for harmful content detection, the 8B variant achieved an AUC of 0.871, outperforming baselines like Llama Guard and ShieldGemma. Its precision-recall trade-offs, represented by an AUPRC of 0.846, reflect its capability to detect harmful prompts and responses. In RAG-related evaluations, the models demonstrated strong performance, with the 8B model achieving an AUC of 0.895 in identifying groundedness issues.

The models’ ability to generalize across diverse datasets, including adversarial prompts and real-world user queries, showcases their robustness. For instance, on the ToxicChat dataset, Granite Guardian demonstrated high recall, effectively flagging harmful interactions with minimal false positives. These results indicate the suite’s ability to provide reliable and scalable risk detection solutions in practical AI deployments.

Conclusion

IBM’s Granite Guardian offers a comprehensive solution to safeguarding LLMs against risks, emphasizing safety, transparency, and adaptability. Its capacity to detect a wide range of risks, combined with open-source accessibility, makes it a valuable tool for organizations aiming to deploy AI responsibly. As LLMs continue to evolve, tools like Granite Guardian ensure that this progress is accompanied by effective safeguards. By supporting collaboration and community-driven enhancements, IBM contributes to advancing AI safety and governance, promoting a more secure AI landscape.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post IBM Open-Sources Granite Guardian: A Suite of Safeguards for Risk Detection in LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Granite Guardian LLMs 风险 IBM AI 安全
相关文章