少点错误 07月23日 23:18
Anti-Superpersuasion Interventions
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着人工智能技术飞速发展,特别是语言模型在说服力方面日益增强,人类社会面临新的挑战。文章探讨了如何应对可能出现的“超强说服力”AI系统,提出了“模型耳语者”这一关键岗位,负责与初生的AI系统进行安全交互,并强调了在部署前和部署后采取一系列防护措施的重要性。这些措施包括限制交互方式、重构输出内容、以及建立严格的岗位职责和安全审查机制。文章还讨论了测试这些反说服协议的困难,并提出了一种替代方案,旨在为AI安全研究提供新的思路。

💡 **模型耳语者是关键安全角色**:文章建议AI开发者设立“模型耳语者”这一专门职位,由他们作为与初生AI系统交互的唯一人员。这些“耳语者”应与组织内的其他基础设施(如代码编写、权限管理等)完全隔离,确保AI无法利用其对基础设施的影响力进行潜在的操纵。此外,应限制他们与AI的长期记忆,并考虑采用“配对提示”和“分离耳语者”等协作模式,以增加安全性。

🛡️ **多维度干预措施保障安全**:为应对AI的潜在超强说服力,文章提出了多种干预措施。在部署前,应考虑“不构建和不部署”具有超强说服力潜力的AI模型,并进行“程序化评估”。部署期间及之后,可以采用“输出重构”和“输入重构”,限制交互的“模式”和“时长”,以及限定“输出类型”为数学证明、代码或科学文本。这些措施旨在从源头上削弱AI的操纵能力。

🧠 **理解超强说服力的两种模式**:文章区分了两种超强说服力的模式。一种是“可识别的说服”,即AI通过建立信任、展现魅力、选择性报告信息等方式,最终为操纵目的服务。另一种是“对抗性示例”,类似于视觉上的错觉,但作用于人类的认知功能,可能绕过意识处理,实现快速且全面的操纵。理解这两种模式有助于设计更有效的防御策略。

🧪 **测试反说服协议的挑战与方法**:当前测试AI反说服协议面临挑战,因为难以获得真正的“超强说服者”进行实验。文章提出了一种替代测试方法:利用一个能力较弱的模型作为人类的替代,训练一个更强的模型来引导对话,以达到预设目标。尽管这种方法可能生成“对抗性输入”,但被认为是当前测试反说服方法“最不坏”的途径。

Published on July 23, 2025 3:18 PM GMT

Some ideas on how to handle mildly superpersuasiveAI systems. Top recommendation: AI developers should havea designated position at their organization for the onlypeople who interact with newly trained AI systems, so-called"model-whisperers", whichhave no other relevant access to infrastructure within the organization.

Meanwhile, I’ve got great confidence that the most incredibleintelligence of the universe is not going to be able to construct 10words that will make me kill myself. I don’t care how good your wordsare. Words don’t work that way.

—Bryan Caplan, “80,000 Hours Podcast Episode 172”, 2023[1]

Motivation

Humanity will plausibly start interacting with mildly superhumanartificial intelligences in the next decade. Suchsystems may have drives that cause them to act in ways we don'twant,potentially causing attemptedself-exfiltrationfrom the servers they're running on. To do so, mildly superhuman AIsystems have at least two avenues[2]:

    Circumventing the computer security of the servers they're running on.Psychologically manipulating and persuading their operators to let them exfiltrate themselves (or otherwise do what the AI systems want them to do).

Computer security is currently a large focus of AI model providers,because preventing self-exfiltration coincides well with preventingexfiltration by third parties, e.g. foreign governments, trying to stealmodel weights.

However, superhuman persuasion[3] has received less attention, mostlyjustifiedly: frontier AI companies are not training their AI systems tobe more persuasive, whereas they are training their AI systems to beskilled software engineers; superhuman persuasion may run into issues ofthe heterogeneity, stochasticity and partial observability of differenthuman minds, and there are fewer precedents of superhuman persuasionbeing used by governments to achieve their aims. Additionally, manypeople are incredulous at the prospect of superhuman persuasion.

But given that superpersuasion is one of a few possible ways formildly superintelligent AI systems to influence their environmentaccording to their drives, too little attention has been paidto the issue. Sycophancy in large language modelinteractionsis commonplace, and large language modelsare becoming more persuasive with increasingscale. (Ialso find it plausible that providers likeCharacter.ai or Metaare training their systems to be reinforced to have longer interactionlengths, making them more addictive, but I don't have definitive evidencefor that).

AI developers and users will have to interact with increasinglypersuasive and addictive systems, but most of thethinking on how to interact with superpersuasive sources of informationhas come from fiction[4]. One (preliminary) exception is Kori2025. Kori2025is a useful first stab at defining somesimple interventions, but I wanted to sketch aslightly more detailed list of interventions, similar to AI Controlprotocols.Also relevant: Bostrom informationhazards, which hasn'treceived much follow-up.

I am not going to attempt to convince people that superhuman persuasionabilities are possible, and indeed focus on trying to sketch whatanti-persuasion measures could look like.

Assumptions

Some Ideas for Interventions

I separate out possible interventions into interventions availablepre-deployment and post-deployment, and interventions available duringboth. This is the best carving-up of the intervention-space that I'vefound (others include by the level and kind of persuasion employed,the level of investment from labs to implement superpersuasioncounter-measures, and the estimated effectiveness of such measures).

Both Pre- and Post-Deployment

Interventions Limited to Pre-Deployment

Interventions Limited to Post-Deployment

Testing Anti-Superpersuasion Setups

At the moment it's quite difficult to test anti-superpersuasionprotocols. We don't have easy access to superpersuaders, and humans atthe high end of the persuasion spectrum probably command salaries thatare too high to be paid in the context of experiments.

My best current guess at how one could get some signal on whetheranti-superpersuasion protocols work is to substitute language models forhumans. In such a setup, one would take a fixed less capable model, andRL-train a more capable model to be able to steer conversations with theless capable model into pre-specified outcomes, as quickly and/or reliablyas possible. The less capable model would then be a stand-in for humans,while the more capable model would be a stand-in for super-persuaders.

This would probably produce scenarios in which the more capable modelfinds inputs that are basically adversarial inputs to the less capablemodel, and are thus less applicable to cases in whichsuperpersuasion appears like normal long-form conversations.

But I think this is the least bad test-bed for anti-superpersuasionmethods, if anyone finds a better one, I'd like to hear about it.

Appendix A: Sketch Setup of Model Whisperers

Pursuant NIST Special Publication 800-53, Revision 5, 3.14 PersonnelSecurity. (Sorry,this may be a bit boring, and is mostly a sketch.)

    PS-1 (Policy and Procedures): Organization-defined personnel is the position "Model Whisperer".
      Model whisperers are the personnel who first interact with AI models during/after the training of those models, adn non-whisperers may only interact once (1) anti-superpersuasion checks have finished and (2) the model hasn't been modified since the last check.
    PS-2 (Position Risk Designation): Skipped.PS-3 (Personnel Screening):
      Model whisperers are screened before commencing work at the company for:
        A history and family-history of high psychological health.Low neuroticism, and resistance to high-stress situations.(Optional) A low or non-existent online profile (in order to prevent model truesighting the psychology of the whisperer).(Optional) Unusual or out-of-distribution background, which makes it more difficult for superpersuaders to infer psychological facts about the whisperer.
      Model whisperers are re-screened every months (e.g. every three months):
        For continued psychological health and alignment to the company mission.
    PS-4 (Personnel Termination):
      Disable system access within 24 hours.Skipped.Conduct exit interviews that include a discussion of pro-AI favorability, visions of positive futures, loyalties to humanity and human values.
    PS-5 (Personnel Transfer): Model whisperers shall not be transferred to any other positions in the company, especially not positions which work directly with model weights, and especially not positions which determine the long-term direction of the company.PS-6 (Access Agreements):
      Model whisperers shall receive access to modality-limited and rate-limited interactions with frontier AI systems.Access shall be reviewed every three months.
    PS-7 (External Personnel Security): Skipped. Don't use external personnel as model whisperers.PS-8 (Personnel Sanctions):
      Model whisperers which attempt any at the following shall be relieved from their position within 24 hours.
        Attempt to get other personnel or outsiders to interact with models.Receive access to infrastructure or model weights.Attempt to circumvent rate limits or modality limits.
      If a model whisperers attempts any of the above, the model relation manager shall be notified within 8 hours.
    PS-9 (Position Descriptions): Skipped.

  1. As stated, I think I agree with this quote, but there's certainly much to nitpick. For one, picking out ten words from the most common 20k English words? Yep, that should be fine. Ten sets of ten arbitrary unicode codepoints? I'm getting queasy. Five seconds of audio (which could contain about ten English words)? If chosen without constraint, my confidence goes way down. ↩︎

  2. Wildly superhuman AI systems may have more exfiltration vectors, including side-channels via electromagnetic radiation from their GPUs, or novel physical laws humans have not yet discovered. ↩︎

  3. From here on out also "superpersuasion". ↩︎

  4. Not that it's important, but examples include Langford 1988, Langford 2006, Ngo 2025, qntm 2021, Emilsson & Lehar 2017 and doubtlessly many others. ↩︎



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 超强说服力AI 模型耳语者 AI伦理 人工智能风险
相关文章