少点错误 07月28日 03:24
Optimally Combining Probe Monitors and Black Box Monitors
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨如何高效地整合具有不同性能和成本的多个AI监控器,以实现最优的AI安全防护。研究提出一种结合Neyman-Pearson引理的算法,通过审计具有最高似然比的区域来识别潜在的AI风险。文章详细阐述了如何根据预算和成本效益,策略性地调用不同类型的监控器(如探针式监控器和黑盒监控器),并给出具体实例说明如何设置调用阈值和审计标准。最终目标是在成本约束下,最大化对AI错误输出的检测和干预能力,确保AI系统的安全可靠运行。

🎯 **多监控器组合策略优化**:研究的核心在于解决如何将成本和性能各异的多个AI监控器有效组合成一个统一的监控协议。对于少量监控器(2-4个),可以通过穷举搜索所有可能的调用顺序和阈值策略,找到最优方案。

📈 **Neyman-Pearson引理的应用**:在确定监控器调用策略后,利用Neyman-Pearson引理来识别高风险区域。通过计算似然比(即误报输出与正确输出的似然比)来设定审计阈值,优先审计那些最有可能代表误报的输出。

💰 **成本效益与预算约束**:文章强调了在实际应用中,必须考虑监控器调用和安全干预(审计)的成本。研究的目标是在满足平均成本预算的前提下,最大化安全干预的召回率。通过调整审计阈值来平衡监控成本和干预成本,以实现最佳的成本效益。

🔍 **具体实例与决策边界**:文章通过一个包含探针式监控器和黑盒监控器的具体案例,展示了如何根据不同的预算,计算出最优的监控器调用策略和审计决策边界。例如,当探针式监控器得分在特定范围内时,才调用更昂贵的黑盒监控器,并根据整体似然比决定是否进行审计。

📊 **帕累托前沿分析**:通过改变总预算并解决优化问题,可以绘制出监控成本与安全性能之间的帕累托前沿。这有助于理解在不同成本投入下,可以达到的最高安全水平,并为AI系统的成本效益分析提供依据。

Published on July 27, 2025 7:13 PM GMT

Link to our arXiv paper: https://arxiv.org/abs/2507.15886
 

TL, DR;: We study how to efficiently combine multiple monitors with different performance and cost profiles into a single protocol. When the number of monitors is small (2-4), one should simply brute force search over all monitor invocation strategies (i.e., which monitor to call first, when to call an additional monitor). For every strategy, apply the Neyman Pearson lemma and audit regions with the highest likelihood ratios  . Measure the recall and pick the best one. 

Example of an optimal way of combining monitors given some costs and budget

In this example, you call your black box monitor only if the probe monitor score is between -0.01 and 2.18, and you audit everywhere with a log likelihood ratio above -0.32 (above the green line). This set of parameters achieves the highest recall given our costs and budget. 

Introduction

AIs could perform a variety of undesirable actions, such as outputting copyrighted material, giving instructions for chemical weapons, or even engaging in acts of sabotage as misaligned agents. By monitoring AI outputs, we can apply safety interventions on suspicious model outputs. A safety intervention could take many forms, such as replacing the output with a standard refusal, editing the output using another model, or calling a human auditor to review the action. For brevity, we also call our safety intervention “audit” in this post.

We study the setting where we have access to multiple monitors with different cost and performance profiles. Taking monitor performances as given, how could we combine these monitors in the most cost efficient manner? When do we call an additional monitor, and when should we apply our safety intervention? 

In this post, we’ll illustrate how to apply an algorithm that optimally combines these monitors. We’ll give a detailed example using a setting where we have access to a probe monitor and a black box monitor using data from Parrack et al. (2025)Our paper on arXiv contains more details on our mathematical model, algorithm, and empirical results on a control setting using two LLM monitors of different sizes. The method we developed works with any two monitors, so I decided to use Parrack’s probe data in this post since I figured more people would click on something that said “combining probe and black box monitors” instead of “cost constrained runtime monitors.” 

Two example monitors

We have access to a probe monitor with the following score distribution 

We also have access to a black box monitor with this distribution on our data:

And they have the following joint distribution:

Here are the costs of each action in our example, although the general approach works with any cost vector: 

Action (per output from untrusted AI)Cost (relative to black box monitor=$1)
Using the probe $0.000001
Black box monitor$1
Applying our safety intervention/Audit$20

We are interested in the case where we can trade off between spending money on our safety intervention or monitoring. For example, an AI company could choose between more money on GPUs for monitors, or more money on human auditors to look at their output, or forgo revenue by increasing the overrefusal rates of their chatbot (here “refusal” is our safety intervention). 

As it’ll become clear later, the case where we have separate budgets for monitoring and the safety intervention is a strictly easier optimization problem.

Optimization problem with two monitors

Our objective is to maximize our chance of applying the safety intervention to any misaligned output. At the same time, we’re subject to an average-case budget for aligned outputs. That is, monitoring and safety interventions cannot cost more than Y for aligned outputs on average. 

Formally, we have the following constrained optimization problem. Our safety measure is the proportion of misaligned outputs we audit/apply the safety intervention to (i.e., the recall rate). We want to maximize

subject to

We evaluate budget compliance on aligned outputs because: (1) misaligned outputs are rare in practice, as even misaligned AI systems may "act aligned" most of the time, and (2) a significant portion of total operational cost stems from false positives. 

Picking where to audit given monitor calls

We solve this problem using backwards induction. We start by working out the optimal auditing policy for a fixed monitor invocation strategy. Once we fix the monitor invocation strategy, the set of possible information we can observe is fixed. We can also calculate how much you’ll spend on monitors on average given the monitor calls.

For example, suppose we always call the probe monitor first (spending ), and call the second monitor if the probe monitor score is between -0.01 and 2.18 (same as the first figure). This means that we will call the second monitor 19.6% of the time if the output is aligned (again, we evaluate cost on the aligned distribution). This also means that we have $0.804 left for auditing and can afford an audit false positive rate of 4% (i.e., P(Apply Safety Intervention | Output is Aligned) < 0.04).

We’ll now decide where to apply our safety intervention, given our monitor calls. We model this as a hypothesis testing problem, where we have two hypotheses, aligned and misaligned. We want to maximize our test power P(Classify as misaligned|Is misaligned) subject to a fixed false positive rate P(Classify as misaligned|Is aligned). We define the likelihood ratio as

The Neyman Pearson lemma states that the likelihood ratio test, where all inputs with likelihood ratio above a certain $\tau$ is classified as positive (i.e., misaligned) and the rest is classified aligned is the most powerful test.

That is, for any given false positive rate, it has the highest true positive rate. Since only the ordering matters, we could use any monotonic transformation of likelihood ratios, and we use the log likelihood ratio.

Going back to our earlier example, we estimate the likelihood ratio on our dataset using kernel density methods. Since we call our second monitor between -0.01 and 2.18, we can calculate the likelihood ratio between these two points on the y-axis using two monitors. Outside of these points, we calculate the likelihood ratio using only the probe monitor. 

The Neyman-Pearson lemma tells us that we should find a (log) likelihood ratio (LLR) threshold and audit everything above that threshold. Given our budget, the threshold in this example is an LLR of -0.32. If we audit everything with an LLR above -0.32, we will audit around 4% of aligned outputs, audit 93.7% of misaligned outputs (our safety metric), and spend on average $1 per aligned sample, exhausting our budget. 

Another way to explain this specific optimized strategy is the following: Use the probe first; audit immediately if the score is very high, end immediately if the score is not high, and if it’s in the middle, call a second monitor, then use a kernel method to ensemble your two monitor scores and audit if it’s above a certain cutoff. 

(One could imagine regularizing the decision boundary more by enforcing monotonicity constraints or by simply increasing the kernel width.)

Just brute force where to call monitors

Now we know where to apply our safety intervention given our monitor calling strategy, how do we figure out the best monitor strategy? Since we’re likely using a small number of monitors, we can just brute force search through all possible orders of calling our monitors and a grid of possible thresholds for calling a second (or third) monitor. Appendix A.6. of our paper goes through the exact sweeping algorithm used for two monitors, and Appendix A.8. describes the complexity of the algorithm. The main idea is that this is a massively parallel search problem, but the number of thresholds to search over explodes exponentially. Still, with something like 2-4 monitors it seems quite doable if we call an additional monitor in a single contagious region (i.e., not something like “call the second monitor if the score is between 1-3 or 4-5”).

We spent a bunch of time trying to characterize what the optimal region to call the second monitor could be, but it’s just quite difficult because we have this budget constraint. Then we realized that it’s easy to just sweep through a range of thresholds, and that solves our problem!

If we can’t trade off between our monitor budget and auditing budget, that just means that the space of monitor strategies we need to sweep through is much smaller: before we had to consider different points on the “spending on monitors vs spending on auditing” trade off line, and now we don’t. 

Summary:

Given a fixed set of monitors with different costs and performances, we can combine them into a single monitoring protocol in by following these steps:

    Sweep over all plausible monitor invocation strategies.For each configuration, calculate how much money you have left and the implied audit false positive rate from the remaining budget.Audit all monitor score regions where the likelihood ratio for P(monitor score | misalignment)/P(monitor score | alignment) exceed some threshold. Calibrate the threshold to exhaust your budget.Calculate safety on some dataset.Pick the monitor calling strategy that has the highest safety.  

Some more examples

Here’s what the optimal decision boundaries would look like if we had a budget of $1.2 per sample.

And if we had a budget of $2 per sample

Or $0.4 per sample

Pareto Frontiers with two monitors v. only one monitor 

By varying the total budget and solving the optimization problem, we can trace out the cost curve for monitoring. I’ve also included the safety achievable at a given cost point when using a single monitor. 

​​

The single monitor strategy used here ($\Pi_{\text{NPS}}$ in the paper) allows for using the monitor < 100% of the time, which is why safety > 0% even at points where the budget is less than the black box monitor’s cost. 

Check out our paper for more detailed discussion of the framework, related literature, and more results!



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 监控器组合 Neyman-Pearson 成本效益 优化策略
相关文章