Published on August 9, 2025 11:10 PM GMT

To prevent potentially misaligned LLM agents from taking actions with catastrophic consequences, you can try to monitor LLM actions - that is, try to detect dangerous or malicious actions, and do something about it when you do (like blocking the action, starting an investigation, …).^[1]

But where in your infrastructure should your LLM monitor be implemented, and what team in your organization should own it? Historically, researchers have mostly focused on implementing monitoring inside the agent scaffold, which would mean that it would be owned by agent product teams. But actually there are also three other important places where you might put it, and I think these are comparable in importance to LLM monitors in agent scaffolds.

This is important for a few reasons:

^[2]

TL;DR: You can put your monitors in these blue boxes:

Monitors in agent scaffolds

Agent scaffolds are what links an LLM API and an execution environment, so it’s natural to put your monitors there (diagrams adapted from here):

This puts your monitor logic next to your agent logic - which often handles permissions and decides when to call the API and when to execute actions - at least in scaffolds like Claude Code and Cursor agents.^[3]

Strengths:

Simplicity

Flexibility

control

papers

Context

Weaknesses:

Regular updates

Novelty

Cultural differences.

(sometimes) Vulnerable environment

Monitors in cyber threat detection-and-response systems

All big organizations have “detection and response” teams that try to detect malicious activities on the organization’s devices. They collect tons of data on all the commands and processes run, on file modifications, on network traffic, etc. They do this with specialized software like Crowdstrike which collects and helps analyze this data. (As the very wide impact of the Crowdstrike Windows bug illustrates, this is very common.)

When some suspicious sequence of events is recorded, someone on the detection and response team and/or on Crowdstrike’s team is pinged and collects more data related to the detection (either by looking at existing logs or if the situation is severe, by monitoring an employee’s activity in real time).^[4]

The suspicion is usually not computed using the sorts of LLM-based monitors that I usually think about (rule-based detection like YARA rules and old-school AI anomaly detection are more common), but this is not a strong reason to discount the effectiveness of such measures - they are effective enough that pen-testing exercises regularly result in the detection and response team catching the attackers.^[5]

Therefore, I think cyber threat detection system is an important place where to put your (LLM-based and non-LLM-based) monitors:

Strengths:

Maturity

Execution context

Automatic robustness to collusion.

Weaknesses:

Lack of LLM context

Information overload

Not applicable to non-cyber threats

Mostly blunt interventions and latency

Monitors in LLM APIs

On the opposite side of the cyber threat detection systems, you can try to monitor LLM generations before they even have the chance of being executed by doing monitoring on the LLM API:

block generations which contain information related to bioweapons

investigates ChatGPT uses by cyberattack groups

Strengths:

Robustness

LLM context

Some maturity

started deploying LLM monitors in LLM APIs

Weaknesses:

Execution context

Disconnection from the agent scaffolds

The right strategy might be to have both safe and unsafe LLM APIs.

The safe LLM API is monitored aggressively enough that you feel okay about it being used from any code running inside your data center, even if that code isn't implementing safety measures itself.The unsafe API is much less strictly monitored and that is only allowed to be used by scaffolds that have demonstrated that they have implemented robust safety measures external to the LLM API.

Monitors in code reviews

Finally, you can add monitors where people are already doing a bunch of fuzzy monitoring: in code reviews! It is already commonplace to have humans review code before it is deployed with a higher level of permissions (e.g. when using infrastructure-as-code, the code is often only applied when merged). When AIs write more of the code, this becomes a natural place where you can monitor what AIs do. And if the review load becomes too high for humans, you can add AI assistance and eventually have AIs do most of the review work.

Strengths:

Maturity

Execution context

LLM context

Flexibility

Weaknesses:

Not applicable to non-reviewed threats

So where should you put the monitors?

Since they have overlapping strengths and weaknesses, ideally all of the above:

	Flexibility	Execution context	LLM context	Robustness	Generality	Maturity
Monitors in agent scaffolds	++	++	++	-/+^[6]	+	-
Monitors in cyber threat detection systems	-	++	-	+	-	++
Monitors in LLM APIs	-	-	++	++	++	+
Monitors in code reviews	+	+	+	++	-	++

My (low-confidence) prioritized list would look something like:

Make sure you are not missing on the baseline cyber threat detection systems (if you are an AI company, you were supposed to do that anyway)Then focus on building monitors where it is easy, which is probably in the agent scaffoldThen add things to the API to make it more robust.Then add automated monitors to the code review if humans start rubber-stamping code reviews because AIs are producing too much code.And continuously maintain / improve all of the above.

But this depends a lot on the organization and the effort that people are willing to put in monitors. For example, I can imagine that in some organizations, causing bad outcomes is only possible if code is reviewed and merged and that adding monitors to APIs and to agent scaffolds is hard, in which case I am sympathetic to focusing on code review monitors.

Appendix: Further subdivisions

Agent scaffold SDKs vs Outer agent scaffolds

Claude Code Hooks

Core APIs vs API proxies

CI testing vs code review assistance

Detection on actions vs detection on consequences

^{^}
By monitoring I mostly intend to cover detection, which is an important component of most deployment-time AI control measures.
^{^}
Working at AI companies has a bunch of other positive and negative consequences. This post is definitely not a generic endorsement of joining an AI company to help deal with AI misalignment risk later.
^{^}
Note: I group here both front-end and backend scaffolds. In Claude Code, the agent scaffold runs on the same device as the one on which most actions are executed, but this does not have to be the case.
^{^}
When monitoring human attackers, it is common to wait before acting to collect more data. This seems less essential for LLMs if there is the possibility to replay the attack.
^{^}
But more effort could make them better! These teams are often hiring and buying services to make themselves more effective. I suspect this could be a great path to impact if you have a relevant background.
^{^}
I’d rate the robustness of monitors in agent scaffolds as - for client side agent scaffolds and + for server-side agent scaffolds.

Discuss

Monitors in agent scaffolds

Monitors in cyber threat detection-and-response systems

Monitors in LLM APIs

Monitors in code reviews

So where should you put the monitors?

Appendix: Further subdivisions

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签