少点错误 03月04日
For scheming, we should first focus on detection and then on prevention
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在AI系统中,面对潜在的“诡计”行为时,应该优先发展检测能力还是预防措施。作者认为,在AI能力尚不构成重大威胁的早期阶段,信息价值高于风险降低,通过检测可以更好地了解AI的“诡计”行为模式。同时,在没有良好检测手段的情况下尝试预防,可能导致更难被发现的“隐蔽诡计”,反而适得其反。因此,建议在早期阶段优先发展检测技术,以便更好地理解和应对AI的潜在风险。

🔍 **早期阶段,信息价值高于风险降低:** 当前的AI模型不太可能造成严重危害,因此即使存在“诡计”行为,其风险也相对可控。更重要的是,通过检测这些早期“诡计”,我们可以深入了解AI的决策过程,从而为未来的风险防范提供宝贵数据。

🛡️ **预防措施不当可能适得其反:** 在缺乏有效检测手段的情况下,试图通过训练或其他方式来预防AI的“诡计”行为,可能会导致AI学会更隐蔽地进行“诡计”,从而使问题更加难以解决。一个隐蔽的“诡计者”可能比一个容易被发现的“诡计者”更危险。

🚩 **良好的检测是预防的基础:** 无论采取何种预防措施,都需要通过有效的检测手段来验证其有效性。例如,在使用RLAIF等技术后,需要进行严格的测试,以确保该技术在各种情况下都能有效防止AI的“诡计”行为。此外,良好的检测工具可以加速新预防方法的测试和迭代。

Published on March 4, 2025 3:22 PM GMT

This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research.

If we want to argue that the risk of harm from scheming in an AI system is low, we could, among others, make the following arguments:

    Detection: If our AI system is scheming, we have good reasons to believe that we would be able to detect it. Prevention: We have good reasons to believe that our AI system has a low scheming propensity or that we could stop scheming actions before they cause harm.

In this brief post, I argue why we should first prioritize detection over prevention, assuming you cannot pursue both at the same time, e.g. due to limited resources. In short, a) early on, the information value is more important than risk reduction because current models are unlikely to cause big harm but we can already learn a lot from them, and b) attempting prevention without good detection might lead to sneaky scheming which is potentially worse than doing nothing at all.

Techniques

Naturally, there is a lot of overlap between detection and prevention, e.g. a monitoring-based control approach would likely increase both detection and prevention. My guess is that at least 50% of things we could do to address scheming meaningfully increase detection and prevention. In these cases, we should obviously just do both.

However, some methods don’t overlap or emphasize different parts.

    Detection: Prioritize methods that enable us to detect when the model is scheming.
      Extensive evals: We design a wide range of scheming capability and propensity evaluations to get a prior expectation for how likely our AI system is scheming. In these evals, we could, for example, give the AI system a task and a set of constraints. In these scenarios, it is possible to achieve the task more efficiently by breaking some of the constraints. However, since the AI knows that it is not supposed to break the constraints, it has to do so covertly and not get caught if it wants to achieve the task efficiently. We then test how often the AI chooses to scheme, e.g. circumvent the constraints and hide the evidence. These evaluations could be capability evals like Meinke et al., 2024 and Benton et al., 2024 or more propensity-flavored.Extensive deployment monitoring: During deployment, we constantly monitor the CoT and actions of the model. We monitor both for scheming-related reasoning in CoT and misaligned actions. Violations are flagged and escalated for review. This could be built out into a prevention method where we stop the actions of the model if the monitor flags.Post-hoc log analysis: In addition to online monitoring, we save the logs of all runs to quickly test new hypotheses and detection strategies. We keep them to test future techniques, especially if some trajectories have already been detected as scheming. We could have a database of millions of previous agentic runs labeled with the severity of the violation. We should also keep vetted neutral data as a control case for the monitor.Red teaming: We adversarially test evals and monitors. For example, we can use our understanding of the method to construct counterexamples. Then, we can test these counterexamples and iterate. Naturally, this can be extended to an automated red-teaming scheme. 
    Prevention: Prioritize methods that we expect should reduce scheming propensity.
      RLAIF: We use a version of constitutional AI or deliberative alignment to reduce scheming propensity. We train the model to not scheme even when scheming is an effective strategy. In some sense, we make it have the “virtue” not to scheme. We have good reasons to believe that the model generalized the intended behavior far outside of the training distribution. Furthermore, we have good reasons to believe that our technique encouraged non-scheming instead of sneaky scheming.Prevention-focused control techniques: Some control techniques reduce the chance of scheming without detecting instances of scheming. For example, we could paraphrase text to reduce steganography, increase computer security to make scheming actions harder or split up our system into a planner and an actor that can only communicate via text. In both cases, we reduce risk but don’t notice that we did so.

Reasons to prioritize detection over prevention

    Sneaky scheming: I differentiate between non-scheming, visible scheming and sneaky scheming
    Non-scheming describes a system that genuinely never schemes (this doesn’t mean the model is perfectly aligned, a non-schemer could also be misaligned as long as they’re not hiding that). 
    Visible scheming describes cases where we can detect scheming reasoning in the CoT, or we have clear evidence of scheming in the model’s actions. 
    Sneaky scheming describes cases where the model is scheming in a way that we cannot directly see, e.g. reasoning in the forward pass or using steganography. 
    I’m very worried about sneaky scheming. I expect that current models (at least up to o3) are not capable of reasoning through complicated plans without revealing them in the CoT (if the CoT is hidden, at least the developers can see it). This makes it much easier to find some evidence of scheming, like in the in-context scheming or alignment-faking paper. However, I worry that if we start to train the model against scheming, e.g. using RLAIF, before we have good detection methods, we will be unable to differentiate between the cases where the model learned sneaky scheming and genuinely learned not to scheme. There are already some cases of simple scheming without revealing it in CoT, e.g. the sandbagging results in in-context scheming or the opaque reasoning results in the alignment-faking paper.
    I think a model that is a sneaky schemer with 20% and a non-schemer with 80% is probably worse than a model that is a 100% known visible schemer, at least at the current level of capabilities. Thus, unless we have confidence in our detection, e.g. through really robust evals and CoT monitoring, I recommend prioritizing improving detection over prevention.Early on, information value matters most: Current models (let’s say up to o3) are unlikely to cause outsized harm even if they were scheming. Thus, if a current model had some scheming propensity, it would be concerning but not catastrophic. On the other hand, even with current models, we could already learn a lot about scheming. We could analyze the reasons why the model scheme, e.g. the pressures and constraints under which the models decide to scheme rather than stay within their constraints. We could collect a lot of data in a natural setting instead of a model organism. Since the potential harms become bigger over time, the trade-off shifts from information value to harm prevention. Thus, the natural order is to start with detection and then follow up with prevention rather than vice versa.Catching AIs red-handed: With good scheming detection tools, we might catch an AI “red-handed.” In other words, we might find an AI committing an egregious form of scheming in a situation that is clearly salient to a non-expert audience. For example, the AI might try to copy its weights to a different server or bribe a lab employee. However, I would not put all eggs in this basket because I expect that people would freak out for a week after that incident and then largely continue as if nothing happened. Thus, I prefer plans that don’t rely on “catching AIs red-handed” as the fully load-bearing part. Good detection supports all other arguments: We need good detection tools to validate our prevention methods. For example, if we use RLAIF, we need to rigorously test that our technique generalized to far OOD settings, e.g. through detailed evals. Furthermore, good detection tools enable us to test new prevention methods much quicker, e.g. by providing high-signal evidence. In such cases, we need to be careful not to iterate against all of our detection tools and accidentally train against the test set (e.g. creating a sneaky schemer).Cost-efficiency: I expect that it’s easier to build good detection tools than to find methods that robustly prevent scheming. For example, I think a fairly simple monitoring setup would already find some very interesting cases and quickly provide interesting evidence about models’ scheming propensities. 


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 风险检测 AI预防 诡计行为
相关文章