How Not to Detect Prompt Injections with an LLM

cs.AI updates on arXiv.org 07月09日 12:01

How Not to Detect Prompt Injections with an LLM

本文揭示了基于已知答案检测（KAD）的LLM集成应用和代理面临prompt注入攻击的严重风险，并设计了一种名为DataFlip的攻击方法，有效规避KAD防御。

arXiv:2507.05630v1 Announce Type: cross Abstract: LLM-integrated applications and agents are vulnerable to prompt injection attacks, in which adversaries embed malicious instructions within seemingly benign user inputs to manipulate the LLM's intended behavior. Recent defenses based on $\textit{known-answer detection}$ (KAD) have achieved near-perfect performance by using an LLM to classify inputs as clean or contaminated. In this work, we formally characterize the KAD framework and uncover a structural vulnerability in its design that invalidates its core security premise. We design a methodical adaptive attack, $\textit{DataFlip}$, to exploit this fundamental weakness. It consistently evades KAD defenses with detection rates as low as $1.5\%$ while reliably inducing malicious behavior with success rates of up to $88\%$, without needing white-box access to the LLM or any optimization procedures.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM prompt注入攻击已知答案检测安全漏洞 DataFlip攻击

相关文章

F5 管理器现漏洞，能让攻击者开设账户并长期潜伏

Import AI 368: 500% faster local LLMs; 38X more efficient red teaming; AI21’s Frankenmodel

Keyboard logging flaw affected nearly a billion users

Learn AI Together — Towards AI Community Newsletter #23

This AI newsletter is all you need #98

Patterns and Middleware for LLM Applications with Kyle Roche - #659

Building LLM-Based Applications with Azure OpenAI with Jay Emery - #657

Mental Models for Advanced ChatGPT Prompting with Riley Goodside - #652

FastGen: Cutting GPU Memory Costs Without Compromising on LLM Quality