MarkTechPost@AI 07月29日 14:17
Safeguarding Agentic AI Systems: NVIDIA’s Open-Source Safety Recipe
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着大型语言模型(LLMs)向自主决策的智能体系统演进,其能力与风险并存。企业在快速采纳智能体AI以实现自动化时,面临着目标不一致、提示注入、意外行为、数据泄露和人类监督减少等新挑战。为应对这些问题,英伟达推出了开源软件套件和训练后安全策略,旨在贯穿智能体AI系统的整个生命周期提供保障。该方案包括部署前的评估、训练后的对齐以及持续的保护措施,并利用了Nemotron Content Safety Dataset、WildGuardMix等开放数据集和基准测试,以提升内容安全和产品安全性,帮助企业在拥抱AI创新的同时,确保安全与合规。

🛡️ **智能体AI的风险与挑战**:智能体LLMs凭借先进的推理和工具使用能力,实现了高度自主性,但也带来了内容审核失败(产生有害、有毒或有偏见输出)、安全漏洞(如提示注入、越狱尝试)以及合规与信任风险(不符合企业政策或监管标准)。传统防护措施已难以应对模型和攻击技术的快速演变,企业需要系统性的全生命周期策略来确保模型与内部政策及外部法规保持一致。

🔧 **英伟达安全策略的端到端框架**:英伟达的智能体AI安全策略提供了一个全面的端到端框架,涵盖评估、对齐和防护。在部署前,通过开放数据集和基准测试评估模型在企业政策、安全要求和信任阈值方面的表现;在训练后,利用强化学习(RL)、监督微调(SFT)和多策略数据集融合,进一步对模型进行安全标准的对齐;部署后,通过NeMo Guardrails和实时监控微服务提供持续的、可编程的防护,主动拦截不安全输出,抵御提示注入和越狱尝试。

📊 **关键组件与数据集**:该安全策略的核心组件包括:部署前评估(使用Nemotron Content Safety Dataset、WildGuardMix等数据集和garak扫描器),训练后对齐(采用RL、SFT和开放数据集),以及部署与推理阶段的持续保护(通过NeMo Guardrails和NIM微服务)。这些组件共同作用,确保模型在内容安全、产品安全和抵御新攻击方面得到持续加固。

📈 **量化安全效益**:英伟达的训练后安全策略在实际应用中取得了显著成效。内容安全方面,模型性能从88%提升至94%,提高了6%;产品安全性方面,模型对对抗性提示(如越狱)的抵御能力从56%提升至63%,提高了7%,且未对模型准确性造成可衡量的损失。

🤝 **生态系统协作与开放获取**:英伟达的方案强调开放性和协作,其安全评估和训练后策略工具、数据集及指南均可公开获取。此外,通过与Cisco AI Defense、CrowdStrike等领先的网络安全提供商合作,将持续的安全信号和事件驱动的改进集成到AI生命周期中。企业可以根据自身定制的业务政策、风险阈值和监管要求,灵活地调整和应用该策略,实现模型的迭代加固和持续可信。

As large language models (LLMs) evolve from simple text generators to agentic systems —able to plan, reason, and autonomously act—there is a significant increase in both their capabilities and associated risks. Enterprises are rapidly adopting agentic AI for automation, but this trend exposes organizations to new challenges: goal misalignment, prompt injection, unintended behaviors, data leakage, and reduced human oversight. Addressing these concerns, NVIDIA has released an open-source software suite and a post-training safety recipe designed to safeguard agentic AI systems throughout their lifecycle.

The Need for Safety in Agentic AI

Agentic LLMs leverage advanced reasoning and tool use, enabling them to operate with a high degree of autonomy. However, this autonomy can result in:

Traditional guardrails and content filters often fall short as models and attacker techniques rapidly evolve. Enterprises require systematic, lifecycle-wide strategies for aligning open models with internal policies and external regulations.

NVIDIA’s Safety Recipe: Overview and Architecture

NVIDIA’s agentic AI safety recipe provides a comprehensive end-to-end framework to evaluate, align, and safeguard LLMs before, during, and after deployment:

Core Components

StageTechnology/ToolsPurpose
Pre-Deployment EvaluationNemotron Content Safety Dataset, WildGuardMix, garak scannerTest safety/security
Post-Training AlignmentRL, SFT, open-licensed dataFine-tune safety/alignment
Deployment & InferenceNeMo Guardrails, NIM microservices (content safety, topic control, jailbreak detect)Block unsafe behaviors
Monitoring & Feedbackgarak, real-time analyticsDetect/resist new attacks

Open Datasets and Benchmarks

Post-Training Process

NVIDIA’s post-training recipe for safety is distributed as an open-source Jupyter notebook or as a launchable cloud module, ensuring transparency and broad accessibility. The workflow typically includes:

    Initial Model Evaluation: Baseline testing on safety/security with open benchmarks.On-policy Safety Training: Response generation by the target/aligned model, supervised fine-tuning, and reinforcement learning with open datasets.Re-evaluation: Re-running safety/security benchmarks post-training to confirm improvements.Deployment: Trusted models are deployed with live monitoring and guardrail microservices (content moderation, topic/domain control, jailbreak detection).

Quantitative Impact

Collaborative and Ecosystem Integration

NVIDIA’s approach goes beyond internal tools—partnerships with leading cybersecurity providers (Cisco AI Defense, CrowdStrike, Trend Micro, Active Fence) enable integration of continuous safety signals and incident-driven improvements across the AI lifecycle.

How To Get Started

    Open Source Access: The full safety evaluation and post-training recipe (tools, datasets, guides) is publicly available for download and as a cloud-deployable solution.Custom Policy Alignment: Enterprises can define custom business policies, risk thresholds, and regulatory requirements—using the recipe to align models accordingly.Iterative Hardening: Evaluate, post-train, re-evaluate, and deploy as new risks emerge, ensuring ongoing model trustworthiness.

Conclusion

NVIDIA’s safety recipe for agentic LLMs represents an industry-first, openly available, systematic approach to hardening LLMs against modern AI risks. By operationalizing robust, transparent, and extensible safety protocols, enterprises can confidently adopt agentic AI, balancing innovation with security and compliance.


Check out the NVIDIA AI safety recipe and Technical details. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

FAQ: Can Marktechpost help me to promote my AI Product and position it in front of AI Devs and Data Engineers?

Ans: Yes, Marktechpost can help promote your AI product by publishing sponsored articles, case studies, or product features, targeting a global audience of AI developers and data engineers. The MTP platform is widely read by technical professionals, increasing your product’s visibility and positioning within the AI community. [SET UP A CALL]

    The post Safeguarding Agentic AI Systems: NVIDIA’s Open-Source Safety Recipe appeared first on MarkTechPost.

    Fish AI Reader

    Fish AI Reader

    AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

    FishAI

    FishAI

    鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

    联系邮箱 441953276@qq.com

    相关标签

    智能体AI AI安全 大型语言模型 英伟达 开源
    相关文章