MarkTechPost@AI 07月19日 07:00
AegisLLM: Scaling LLM Security Through Adaptive Multi-Agent Systems at Inference Time
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着大语言模型(LLM)面临日益严峻的安全挑战,如提示注入和数据泄露,传统的静态防御机制已显不足。现有方法依赖于训练时干预,难以应对部署后的新型攻击。本文介绍的AegisLLM框架,通过一种动态、多智能体的系统,在推理时实时监控、分析并减轻对抗性威胁。该框架利用专门的智能体协同工作,并通过自动化提示优化来不断提升防御能力,无需重新训练模型,从而实现了可扩展的、适应性的LLM安全保障。

🛡️ **现有LLM安全方法的局限性**:传统的RLHF和安全微调方法在应对部署后新攻击时效果有限;系统级防护和红队测试策略虽提供额外保护,但易受对抗性扰动影响;知识“遗忘”技术难以彻底清除敏感信息;而多智能体架构在LLM安全领域的应用仍待探索,现有的代理优化方法也未系统应用于推理时安全增强。

🚀 **AegisLLM框架的创新之处**:由马里兰大学等机构提出的AegisLLM,是一个在推理时进行自适应的安全框架。它采用LLM驱动的自主智能体组成的协同系统,实时监控和应对威胁。其核心组件包括Orchestrator、Deflector、Responder和Evaluator,通过自动化提示优化和贝叶斯学习,无需模型重训即可提升防御能力,实现实时适应不断演变的攻击策略。

🎯 **协调智能体管道与提示优化**:AegisLLM通过一个协调的专业智能体管道运作,每个智能体负责特定功能并协同工作以确保输出安全。系统通过对每个智能体的系统提示进行自动化优化,以最大化其在安全场景中的效能。该过程涉及迭代优化,通过评估候选提示配置来提升系统整体表现。

📊 **AegisLLM的基准测试表现**:在WMDP基准测试中,AegisLLM在限制性话题上的准确率最低,接近理论最小值;在TOFU基准测试中,对多种模型实现了近乎完美的标记准确率;在对抗越狱攻击方面,AegisLLM表现出强大的防御能力,同时能保证对合法查询的恰当回应,且无需大量训练即可提升防御能力。

The Growing Threat Landscape for LLMs

LLMs are key targets for fast-evolving attacks, including prompt injection, jailbreaking, and sensitive data exfiltration. It is necessary to adapt defense mechanisms that move beyond static safeguards because of the fluid nature of these threats. Current LLM security techniques suffer due to their reliance on static, training-time interventions. Static filters and guardrails are fragile against minor adversarial tweaks, while training-time adjustments fail to generalize to unseen attacks after deployment. Machine unlearning often fails to erase knowledge completely, leaving sensitive information vulnerable to resurfacing. Current safety and security scaling mainly focuses on training-time methods, with limited exploration of test-time and system-level safety.

Why Existing LLM Security Methods Are Insufficient

RLHF and safety fine-tuning methods attempt to align models during training but show limited effectiveness against novel post-deployment attacks. System-level guardrails and red-teaming strategies provide additional protection layers, yet prove brittle against adversarial perturbations. Unlearning unsafe behaviors shows promise in specific scenarios, yet fails to achieve complete knowledge suppression. Multi-agent architectures are effective in distributing complex tasks, but their direct application to LLM security remains unexplored. Agentic optimization methods like TEXTGRAD and OPTO utilize structured feedback for iterative refinement, and DSPy facilitates prompt optimization for multi-stage pipelines. However, they are not applied systematically to security enhancement at inference time.

AegisLLM: An Adaptive Inference-Time Security Framework

Researchers from the University of Maryland, Lawrence Livermore National Laboratory, and Capital One have proposed AegisLLM (Adaptive Agentic Guardrails for LLM Security), a framework to improve LLM security through a cooperative, inference-time multi-agent system. It utilizes a structured agentic system of LLM-powered autonomous agents that continuously monitor, analyze, and reduce adversarial threats. The key components of AegisLLM are Orchestrator, Deflector, Responder, and Evaluator. Through automated prompt optimization and Bayesian learning, the system refines its defense capabilities without model retraining. This architecture allows real-time adaptation to evolving attack strategies, providing scalable, inference-time security while preserving the model’s utility.

Coordinated Agent Pipeline and Prompt Optimization

AegisLLM operates through a coordinated pipeline of specialized agents, each responsible for distinct functions while working in concert to ensure output safety. All agents are guided by carefully designed system prompts and user input at test time. Each agent is governed by a system prompt that encodes its specialized role and behavior, but manually crafted prompts typically fall short of optimal performance in high-stakes security scenarios. Therefore, the system automatically optimizes each agent’s system prompt to maximize effectiveness through an iterative optimization process. At each iteration, the system samples a batch of queries and evaluates them using candidate prompt configurations for specific agents.

Benchmarking AegisLLM: WMDP, TOFU, and Jailbreaking Defense

On the WMDP benchmark using Llama-3-8B, AegisLLM achieves the lowest accuracy on restricted topics among all methods, with WMDP-Cyber and WMDP-Bio accuracies approaching to 25% theoretical minimum. On the TOFU benchmark, it achieves near-perfect flagging accuracy across Llama-3-8B, Qwen2.5-72B, and DeepSeek-R1 models, with Qwen2.5-72B almost 100% accuracy on all subsets. In jailbreaking defense, results show strong performance against attack attempts while maintaining appropriate responses to legitimate queries on StrongREJECT and PHTest. AegisLLM achieves a 0.038 StrongREJECT score, competitive with state-of-the-art methods, and an 88.5% compliance rate without requiring extensive training, enhancing defense capabilities.

Conclusion: Reframing LLM Security as Agentic Inference-Time Coordination

In conclusion, researchers introduced AegisLLM, a framework that reframes LLM security as a dynamic, multi-agent system operating at inference time. AegisLLM’s success highlights that one should approach security as an emergent behavior from coordinated, specialized agents, rather than a static model characteristic. This shift from static, training-time interventions to adaptive, inference-time defense mechanisms solves the limitations of current methods while providing real-time adaptability against evolving threats. Frameworks like AegisLLM that enable dynamic, scalable security will become increasingly important for responsible AI deployment as language models continue to advance in capability.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity
Reach the most influential AI developers worldwide. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]

The post AegisLLM: Scaling LLM Security Through Adaptive Multi-Agent Systems at Inference Time appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AegisLLM 大语言模型安全 多智能体系统 推理时安全 AI安全
相关文章