少点错误 03月06日
We should start looking for scheming "in the wild"
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在实际部署环境中监测人工智能(AI)模型阴谋行为的可能性和必要性。文章指出,现有研究表明AI模型已具备初步的阴谋能力,因此有必要在开发者内部和外部部署环境中,通过监测模型行为来获取相关信息。文章提出了具体的实施方案,包括使用字符串和模型分类器自动扫描轨迹数据,并强调在监测过程中需要关注用户和开发者的隐私保护。通过这种监测,可以了解模型在实际使用中的阴谋倾向,并为未来的AI安全研究提供数据支持。

🕵️‍♂️AI模型阴谋是指模型秘密地采取行动或推理,以对抗开发者或用户的明确或隐含目标。这不包括拒绝或其他形式的公开不一致,因为它不是秘密的。

📊通过监控AI模型在实际应用中的行为,可以获取模型阴谋行为的基准数据,这对于评估AI安全性至关重要。文章建议从开发者内部使用数据开始,利用字符串和模型分类器自动扫描存储的轨迹数据。

🔒在内部部署环境中,并非所有员工都有权访问所有存储的轨迹。因此,文章建议首先使用没有公司内部限制的数据进行迭代,一旦感觉有了好的分类器,就在具有更强限制的数据上运行这些分类器,并使用隐私保护工具来查看它们是否标记了任何内容。

🧪文章建议使用多种分类器,包括基于字符串的分类器(例如,查找与阴谋相关的关键词)和基于模型的分类器(例如,使用快速模型和各种提示来评估阴谋得分)。文章还强调了校准分类器的重要性,可以使用现有阴谋研究中的例子来校准分类器。

Published on March 6, 2025 1:49 PM GMT

TLDR: AI models are now capable enough that we might get relevant information from monitoring for scheming in regular deployments, both in the internal and external deployment settings. We propose concrete ideas for what this could look like while preserving the privacy of customers and developers.

What do we mean by in the wild?

By “in the wild,” we basically mean any setting that is not intentionally created to measure something (such as evaluations). This could be any deployment setting, from using LLMs as chatbots to long-range agentic tasks. 

We broadly differentiate between two settings

    Developer-internal deployment: AI developers use their AI models internally, for example, as chatbots, research assistants, synthetic data generation, etc. Developer-external deployment: AI chatbots (ChatGPT, Claude, Gemini, etc.) or API usage. 

Since scheming is especially important in LM agent settings, we suggest prioritizing cases where LLMs are scaffolded as autonomous agents, but this is not a hard criterion.

What are we looking for?

Broadly speaking, we’re looking for any setting where an AI model acts covertly or reasons about acting covertly against an explicit or implicit goal of the developer/user

This setting does not include refusals or other kinds of explicit misalignment because they are not covert.

Examples include:

Importantly, the class of “scheming in-the-wild” cases we’re looking for is much bigger than “catching AIs red-handed” and includes less egregious and many more cases of scheming. We mostly just want to get a sense of how often models show this kind of behavior rather than demonstrating any particularly concerning case. 

Why monitor for scheming in the wild?

    Models show early scheming capabilities: Both the in-context scheming and alignment-faking papers show that models are capable of basic scheming.Models show some propensities related to scheming: Both the in-context scheming (including follow-up work) and alignment-faking papers show various cases of scheming propensities. There are also cases where reasoning models cheat in chess when given the chance.We could learn a lot: In summer 2024, I would have expected that models show something that I would consider “scheming-related” about one in 1M natural rollouts or something as low as that. After working on the in-context scheming paper and seeing the alignment-faking work, I’d say it might be more frequent than that, e.g. every 10k natural rollouts. Given that all of the existing evidence feels somewhat artificial, I think it would be really interesting to get better base rates for this happening in actual use-cases.  It’s probably easy to do: I expect that doing a simple version of this kind of post-hoc monitoring is very simple and cheap to do, especially if the infrastructure to investigate large amounts of usage data already exists.

Concrete ideas 

I’d recommend starting only with developer-internal usage data. Presumably these trajectories are already stored for standard debugging and development purposes. Otherwise, I’d collect them for a period of a few weeks so we get at least a million trajectories. 

Then, I’d automatically scan through many of these stored trajectories. By default, I’d do some exploratory scanning on e.g. 10k samples to quickly improve my classifiers. I would use a mix of many classifiers, including:

My MATS scholars tested more than 100 different variations of prompts in an exploratory manner and then narrowed it down to a much smaller number later. During the exploratory scanning, I’d let the classifier give the trajectory a “scheming score” between 1 and 100. This score can be used later when we suggest new classifiers, or it can be used for few-shot prompts to improve classifiers. To calibrate the classifier, I’d use examples from existing work on scheming. 

I think all or almost all prompts in the exploration phase may turn out to be benign. Thus, I’d resample the exploratory samples a few times and always keep the ones that seem most concerning to get at least some cases that could be classified as somewhat scheming. This intentional oversampling would be mostly used to get a better feeling and calibration for the classifiers.  

For the first setting, I’d only use internal use cases that can be widely shared within the organization (e.g., no access restrictions) because we will naturally have to look at lots of examples to double-check the classifiers and the main objective is to find good classifiers, as quickly as possible. 

Privacy concerns

There are different kinds of privacy concerns for both internal and external deployment settings.

In the internal setting, typically, not all employees have access to all stored trajectories. For example, sharing the trajectories of the pre-training team might reveal company internal IP that is not accessible to most employees, or the trajectories of the frontier red team might also be restricted. To work with these constraints, we could do some of the following:

Once you have a large set of classified data rated by multiple classifiers, we can start with data analysis. We should start with the examples with the highest scheming scores per classifier and can make the depth of the analysis mostly dependent on early findings. If even the worst examples look very benign after an honest look, we can basically call it a day and come back in a few months. 

If the worst examples look interesting or surprising, we can continue with more analysis. For example, 

There are many fairly simple ways to run a project like this that would provide a lot of insight into the model's real-world tendencies. With infrastructure like CLIO in place for other properties, I think an 80/20 version of this project could be done in less than one week and still yield very interesting insights. 

At Apollo, we’d be happy to help with such a project to the best of our ability. I guess that in most cases, it will be unrealistic for us to access large amounts of company-internal data, but we would be happy to consult the respective team wherever we can. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI模型 阴谋行为 隐私保护 安全监控
相关文章