少点错误 17小时前
Four places where you can put LLM monitoring
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

为防止潜在的失控LLM代理产生灾难性后果,对LLM行为进行监控至关重要。本文探讨了四种关键的监控部署位置:代理脚手架、网络威胁检测与响应系统、LLM API以及代码审查。每种位置都有其独特的优势和劣势,涉及简单性、灵活性、上下文信息、成熟度、鲁棒性和LLM上下文理解等方面。文章分析了不同部署策略的适用场景,并提出了一个低置信度的优先级排序建议,强调了结合多种监控方式以实现全面AI安全控制的重要性。

🤖 **代理脚手架监控**:这是将LLM API与执行环境连接的自然位置,易于修改且拥有LLM和执行环境的双重上下文。但其弱点在于更新频繁、新颖性不足、可能存在文化差异以及在用户可控环境中易受攻击。

🚨 **网络威胁检测与响应系统监控**:利用组织现有的成熟基础设施,该系统在执行上下文方面具有优势,并能自动应对共谋行为。然而,它缺乏LLM的深层上下文,可能面临信息过载,且对非网络威胁的适用性有限,干预方式较为粗糙。

🌐 **LLM API监控**:此位置的优势在于其鲁棒性和LLM的全部上下文信息,且已有初步的部署实践。但其缺点是缺乏执行上下文,与代理脚手架的协调效率不高,可能需要区分安全与不安全的API。

👨‍💻 **代码审查监控**:借鉴了长久以来的人工代码审查实践,在执行上下文和LLM上下文(即变更理由)方面具有优势,且改动灵活。然而,其局限性在于仅适用于通过代码审查的威胁,不适用于未审查的潜在风险。

Published on August 9, 2025 11:10 PM GMT

To prevent potentially misaligned LLM agents from taking actions with catastrophic consequences, you can try to monitor LLM actions - that is, try to detect dangerous or malicious actions, and do something about it when you do (like blocking the action, starting an investigation, …).[1]

But where in your infrastructure should your LLM monitor be implemented, and what team in your organization should own it? Historically, researchers have mostly focused on implementing monitoring inside the agent scaffold, which would mean that it would be owned by agent product teams. But actually there are also three other important places where you might put it, and I think these are comparable in importance to LLM monitors in agent scaffolds.

This is important for a few reasons:

    This might inspire us to focus on developing better monitoring techniques and control measures that can be implemented strictly with the affordances of one of these four places.It tells us about which teams in the AI companies might be crucial for AI control efforts later; it might be good to talk to these teams about AI control now, or to work with those teams (even on projects that aren’t directly relevant for mitigating risk from misaligned AI) in the hope of helping with implementing effective control techniques later.[2]

TL;DR: You can put your monitors in these blue boxes:

Monitors in agent scaffolds

Agent scaffolds are what links an LLM API and an execution environment, so it’s natural to put your monitors there (diagrams adapted from here):

This puts your monitor logic next to your agent logic - which often handles permissions and decides when to call the API and when to execute actions - at least in scaffolds like Claude Code and Cursor agents.[3]

Strengths:

Weaknesses:

Monitors in cyber threat detection-and-response systems

All big organizations have “detection and response” teams that try to detect malicious activities on the organization’s devices. They collect tons of data on all the commands and processes run, on file modifications, on network traffic, etc. They do this with specialized software like Crowdstrike which collects and helps analyze this data. (As the very wide impact of the Crowdstrike Windows bug illustrates, this is very common.)

When some suspicious sequence of events is recorded, someone on the detection and response team and/or on Crowdstrike’s team is pinged and collects more data related to the detection (either by looking at existing logs or if the situation is severe, by monitoring an employee’s activity in real time).[4]

The suspicion is usually not computed using the sorts of LLM-based monitors that I usually think about (rule-based detection like YARA rules and old-school AI anomaly detection are more common), but this is not a strong reason to discount the effectiveness of such measures - they are effective enough that pen-testing exercises regularly result in the detection and response team catching the attackers.[5]

Therefore, I think cyber threat detection system is an important place where to put your (LLM-based and non-LLM-based) monitors:

Strengths:

Weaknesses:

Monitors in LLM APIs

On the opposite side of the cyber threat detection systems, you can try to monitor LLM generations before they even have the chance of being executed by doing monitoring on the LLM API:

Strengths:

Weaknesses:

The right strategy might be to have both safe and unsafe LLM APIs.

Monitors in code reviews

Finally, you can add monitors where people are already doing a bunch of fuzzy monitoring: in code reviews! It is already commonplace to have humans review code before it is deployed with a higher level of permissions (e.g. when using infrastructure-as-code, the code is often only applied when merged). When AIs write more of the code, this becomes a natural place where you can monitor what AIs do. And if the review load becomes too high for humans, you can add AI assistance and eventually have AIs do most of the review work.

Strengths:

Weaknesses:

So where should you put the monitors?

Since they have overlapping strengths and weaknesses, ideally all of the above:

 FlexibilityExecution contextLLM contextRobustnessGeneralityMaturity
Monitors in agent scaffolds++++++

-/+[6]

+-
Monitors in cyber threat detection systems-++-+-++
Monitors in LLM APIs--+++++++
Monitors in code reviews+++++-++

My (low-confidence) prioritized list would look something like:

    Make sure you are not missing on the baseline cyber threat detection systems (if you are an AI company, you were supposed to do that anyway)Then focus on building monitors where it is easy, which is probably in the agent scaffoldThen add things to the API to make it more robust.Then add automated monitors to the code review if humans start rubber-stamping code reviews because AIs are producing too much code.And continuously maintain / improve all of the above.

But this depends a lot on the organization and the effort that people are willing to put in monitors. For example, I can imagine that in some organizations, causing bad outcomes is only possible if code is reviewed and merged and that adding monitors to APIs and to agent scaffolds is hard, in which case I am sympathetic to focusing on code review monitors.

Appendix: Further subdivisions

  1. ^

     By monitoring I mostly intend to cover detection, which is an important component of most deployment-time AI control measures.

  2. ^

    Working at AI companies has a bunch of other positive and negative consequences. This post is definitely not a generic endorsement of joining an AI company to help deal with AI misalignment risk later.

  3. ^

     Note: I group here both front-end and backend scaffolds. In Claude Code, the agent scaffold runs on the same device as the one on which most actions are executed, but this does not have to be the case.

  4. ^

     When monitoring human attackers, it is common to wait before acting to collect more data. This seems less essential for LLMs if there is the possibility to replay the attack.

  5. ^

     But more effort could make them better! These teams are often hiring and buying services to make themselves more effective. I suspect this could be a great path to impact if you have a relevant background.

  6. ^

     I’d rate the robustness of monitors in agent scaffolds as - for client side agent scaffolds and + for server-side agent scaffolds.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM监控 AI安全 代理脚手架 网络安全 LLM API
相关文章