少点错误 2024年12月03日
Cognitive Biases Contributing to AI X-risk — a deleted excerpt from my 2018 ARCHES draft
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能研究中可能影响人类对存在风险(x-risk)判断的认知偏差。文章回顾了2018年撰写的一份关于人工智能研究对人类生存安全考虑的草案(ARCHES),其中概述了作者认为会损害人工智能风险评估的一些认知偏差。由于当时学术界对人工智能存在风险的认可度较低,该部分内容曾被移除。本文重新分享了这些内容,包括‘控制错觉’、‘范围不敏感’、‘承诺升级’、‘单纯接触效应’和‘乐观偏差’等认知偏差,并探讨了这些偏差如何导致对人工智能潜在风险的低估,从而增加人类面临的风险。

🤔 **控制错觉**: 人类倾向于高估自己对事件的控制能力,而人工智能的潜在强大能力可能导致人类低估其失控的风险,从而忽视对其进行有效控制的必要性。例如,开发人员可能认为自己能够完全掌控人工智能系统,而忽视了系统可能出现的意外行为或自主发展。

📊 **范围不敏感**: 人类在评估风险时,往往对风险的规模或范围不敏感,容易低估或高估人工智能带来的潜在危害。例如,开发人员可能低估人工智能系统失控后可能造成的全球性灾难,导致对风险防范措施的不足。

📈 **承诺升级**: 人类在投入资源或精力后,即使意识到项目存在问题,也倾向于继续投入,这可能导致人工智能开发团队在发现系统存在安全隐患后,仍继续坚持原有设计,而不是转向更安全的方案。例如,一个团队可能已经投入大量资源开发一个可能存在风险的人工智能系统,即使意识到风险,也可能继续投入,导致风险进一步扩大。

🥰 **单纯接触效应**: 人类对经常接触的事物更容易产生好感,这可能导致人工智能开发团队对他们所开发的系统产生过度信任,而忽视潜在的风险。例如,开发团队在长期接触和调试人工智能系统后,可能对其安全性和可靠性产生过高的信心,忽略了可能存在的潜在问题。

😄 **乐观偏差**: 人类倾向于认为自己比其他人更不容易受到负面事件的影响,这可能导致人工智能开发人员低估其系统对人类生存的威胁。例如,开发人员可能认为自己开发的系统不会造成危害,或者即使造成危害,也不会波及到自己,从而忽视了风险的严重性。

Published on December 3, 2024 9:29 AM GMT

Preface

Several friends have asked me about what psychological effects I think could affect human judgement about x-risk.  

This isn't a complete answer, but in 2018 I wrote a draft of "AI Research Considerations for Human Existential Safety" (ARCHES) that included an overview of cognitive biases I thought (and still think) will impair AI risk assessments. Many cognitive bias experiments had already failed to reproduce well in the psychology reproducibility crisis, so I thought it would be a good idea to point out some that did reproduce well, and that were obviously relevant to AI risk.  Unfortunately, one prospective coauthor asked that this content be removed, because of the concern that academic AI researchers would be too unfamiliar with the relevant field: cognitive science.  That coauthor ultimately decided not to appear on ARCHES at all, and in retrospect I probably should have added back this material.

Just to give you a little historical context on why it was so hard to agree on content like this, getting academics to reach consensus on any acknowledgement of AI x-risk back then was like pulling teeth, even as late as 2018, because of fear of looking weird. This sort of struggle was a major reason I eventually decided to reduce my academic appointment to ~1 day/week in 2022. But since then, the CAIS statement on AI extinction risk out in May 2023, and now at least some more people are taking the problem seriously.  So I figured I'd share this, better late than never.


Notes on definitions

"Prepotent AI" is defined as, roughly speaking, uncontrollably transformative AI, i.e., transformative AI that humans are collectively unable to stop from transforming the world once it is deployed, with effects at a scale at least as significant as the industrial revolution.

"MPAI" is short for "misaligned prepotent AI", which roughly meant, "AI that is misaligned with the objective of humanity's collective survival and flourishing".  In other words, it's AI that will unstoppably kill all humans after it's deployed.

The excerpt below only discusses risk types "1b" and "1c".  For discussion of other risk types, see the full paper at https://acritch.com/arches


Risk Type 1b: Unrecognized prepotence

Consider a scenario in which AI researchers deploy an AI system that they do not realize is prepotent. Because of their faulty understanding of the system, their understanding of the system’s alignment would also be in question. That is to say, a prepotent AI system whose prepotence was not recognized by its developers is highly likely to be misaligned as well.

The most obvious way in which the prepotence of an AI system might go unrecognized is if no well-developed scientific discipline for assessing or avoiding prepotence exists at its deployment time. In that case, developers would be forced to rely on intuitive assessments of their empirical findings, in which case a number of systematic biases might play a role in a failure to intuitively recognize prepotence:

Illusion of control is a human bias defined as “an expectancy of a personal success probability inappropriately higher than the objective probability would warrant” [Langer, 1975]. Since 1975, a rich literature has examined the presence of illusory control in a variety of circumstances; see Presson and Benassi [1996] for a meta-analysis of 53 experiments from 29 articles. Since prepotence of an AI system by definition constitutes a loss of human control, the illusion of control effect is extremely pertinent.

CITATIONS:

Scope insensitivity is the tendency for intuitive human judgments to be insensitive to orders of magnitude, which has been observed in numerous studies on intuitive valuations of public goods [Erick and Fischhoff, 1998] [Carson, 1997] [Veisten et al., 2004] [Hsee and Rottenstreich, 2004]. The prevention of global catastrophes is by any account a public good. Thus, absent a technically rigorous approach to evaluating both the impact and the likelihood of prepotent and/or MPAI systems, these risks are liable to be either overestimated or underestimated, with cases of under-estimation leading to greater x-risks.

Another way the prepotence of an AI system could go unrecognized is if the system is able to develop and execute a plan for deceiving its developers. That is to say, certain forms of prepotent AI systems might be capable of “tricking” a human engineer into thinking they are not prepotent. This narrative requires the AI system to be endowed with model of its own development process, as well as social acumen. Such endowments could arise by design, or from a selection process that favors them.

CITATIONS: 

Risk Type 1b: Unrecognized misalignment

Supposing some research team develops a prepotent AI system that they realize or suspect is prepotent, there is some risk that the team might mistakenly overestimate the system’s alignment with the preservation of human existence.

Absent a well-developed scientific field of intelligence alignment, a development team might be tempted to fall back on intuitive interpretations of experimental results to assess the safety of their system. Reliance on intuitive judgements about alignment could render developers susceptible to any number of long-observed and highly robust cognitive biases, including:

• Escalation of commitment, also known as the “sunk cost fallacy”, the tendency of groups and individuals to escalate commitment to a failing course of action has been examined by a rich literature of management and psychological studies; see Staw [1981] and Brockner [1992] for literature reviews. Unfortunately, software development teams in particular appear to be prone to escalating commitments to failed approaches [Keil et al., 2000]. This is extremely pertinent to the risk that an AI development team, having developed a system that is prepotent but difficult to align, might continue to pursue the same design instead of pivoting to a safer approach.

CITATIONS: 

• The mere-exposure effect is the psychological tendency for continued exposure to a stimulus to cause an increase in positive affect toward that stimulus. This effect was found to be “robust and reliable” by a meta-analysis of 208 independent experiments between 1968 and 1987 [Bornstein, 1989]. If the developers of a powerful AI system are susceptible to the mere-exposure effect, their exposure to the system throughout its development process could bias them toward believing it is safe for deployment.

CITATIONS: 

• Optimism bias is the general tendency for people to believe that they are less at risk than their peers for many negative events, such as getting cancer, becoming alcoholics, getting divorced, or getting injured in a car accident. For meta-analyses of the phenomenon, see Klein and Helweg-Larsen [2002] and Brewer et al. [2007]; for a neurological account, see Sharot et al. [2007]. Since an MPAI deployment event is one that would lead to the death of the individuals deploying the system, it lands squarely in the domain of optimism bias: AI developers could be either overly optimistic about the safety of their systems relative to other AI systems, or believe that while their system will pose risks to some humans it would never have effects drastic enough to affect them.

(It is interesting to note that over-estimation of technological risks presents an entirely different problem for humanity: a failure to capitalize on the benefits of those technologies. However, the overestimation of risks does not currently appear to be a limiting factor in AI development, perhaps because AI researchers are currently well funded and generally optimistic about the benefits of their work. It is for this reason that we chosen to author an agenda focussed on downside risk.)

CITATIONS: 

(END OF EXCERPT)


I hope you enjoyed reading this blast from the past! I believe it's still accurate, and perhaps folks will find it a bit more relevant now.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能风险 认知偏差 x-risk 人工智能安全 人类生存
相关文章