Published on February 8, 2025 9:19 PM GMT

Epistemic status

These are my cursory thoughts on this topic after having read about it over a few days and conversed with some other people. I still have high uncertainty and am raising questions that may address some uncertainties.

Content warning

Discussion of risks of astronomical suffering

Why focus on s-risks to contemporary humans?

Most discussions of suffering risks from artificial superintelligence focus on future minds or non-human beings. These concerns are important. However, might an ASI also inflict severe suffering on humans who exist when it takes over or on simulations of their minds?

If this specific category of s-risks is significant, I think that talking about it may encourage more people to care about s-risks even if they do not believe in the moral foundations of longtermism. Rightly or wrongly, many people value the well-being of themselves and people they know in the present over that of other beings. If we can show that s-risks could affect contemporary humans, that can help build broader support for avoiding these risks.

Some of the questions I raise here also apply to estimating s-risk probabilities more generally.

Summary

Placing a probability on s-risks seems difficult due to limited understanding of AI's goals. However, there are some factors that may increase this risk beyond mere chance, including the path-dependent nature of AI development, instrumental use of sentient simulations, spiteful inner objectives, near-miss scenarios, and misuse of intent-aligned AI. However, the overall probability remains uncertain.

Difficulty of assigning probability to human-preserving goals of misaligned ASI

Existing AI models often behave in unpredicted ways,^[1] and if AI reaches human-level intelligence before we solve inner alignment, its inner objective will be uncontrollable. This unpredictability makes it hard to meaningfully estimate the probability that an unaligned ASI’s terminal goals will involve human minds (virtual or physical, suffering or happy)^[2] rather than being a “paperclip maximizer.” ^[3] A few arguments suggest that the consequence of misalignment is much more likely to be “mere extinction” than s-risks affecting contemporary humans:

ignorance prior

^[4]

^[5]

Outside view

^[6]

e.g.

However, even if the probability of s-risk is low compared to x-risk, it may still be worth worrying about, given the astronomically worse stakes.^[7] While it is plausible that the risk is too low to worry about, even given these stakes;^[8] it is also plausible that the risk is significant because of the uncertainty about what prior to assign and reasons to think that the probability is higher than it would seem a priori.

The goals of a misaligned AI may not be purely random. When current AI systems fail, they typically optimize for proxies that correlate with their training objectives rather than completely unrelated goals. This matters because most AI systems train extensively on human-related tasks and data. Even a misaligned AI might develop goals that involve humans or human models, but not necessarily in ways we want. However, it is unclear how likely this is to involve conscious humans rather than other ways of maximizing reward.

Questions:

Assuming that an ASI develops a random terminal goal, what prior should one place on such a goal involving humans? To what extent is it accurate to describe a misaligned AI’s inner objective as random?What evidence do we have that can help us predict an unaligned AI’s goals? What does this evidence say about whether its goals would involve human suffering rather than extinction?

Instrumental simulations

Even an AI with a terminal goal of creating paperclips may instantiate suffering for instrumental reasons.^[9]

For example, Nick Bostrom suggests that an ASI may run conscious simulations of human minds in order to understand our psychology.^[10] Sotala and Gloor (2017) describe other scenarios in which suffering simulations may come about, including "The superintelligence comes into conflict with another actor that disvalues suffering, so the superintelligence instantiates large numbers of suffering minds as a way of extorting the other entity."

How might contemporary humans be affected by this? A question that I haven’t seen addressed is how the AI would obtain enough data on humans or other sentient creatures to create simulations of them? Perhaps it would have to experiment on physical humans first. Or, even if it can obtain this information another way, would these simulations contain "new" minds or copies of particular existing human minds?

Significance

This kind of suffering may be smaller in scope than other risks because the agent faces opportunity costs. For example, Bostrom suggests that conscious simulations would eventually be discarded “once their informational usefulness has been exhausted.” However, Gloor (2016) writes that “an AI with goals that are easy to fulfill – e.g. a ‘paperclip protector’ that only cares about protecting a single paperclip, or an AI that only cares about its own reward signal – would have much greater room pursuing instrumentally valuable computations.”

The agent would also not necessarily maximize the level of suffering, but would only need to incur as much as is necessary to achieve its goals.

However, instrumental suffering is perhaps more likely than other kinds of s-risk because it can happen across a wide range of terminal goals (not just those that involve sentient minds for terminal reasons), provided that it is correct that creating sentient simulations is a convergent instrumental goal.

Questions

^[11]

^[12]

Spite

Might an ASI be more likely than pure chance to develop a terminal goal of human suffering?

Macé et al. (2023) suggest some reasons why spite may naturally arise in AI systems. For example, AI may learn a spiteful objective if humans demonstrate a similar objective in its training data, or a spiteful objective may be a convergent instrumental strategy.

Given that AIs are created by humans and trained based on human data, we shouldn’t assume a priori that the likelihood of ASI exhibiting anthropomorphic behavior is no more likely than other sections of the possibility space, even if the probability remains small. Since AI are trained to mimic humans, humans may serve as a rough reference class when predicting AI behavior, although it is hard to draw strong anthropomorphic parallels given the significant differences.

Some have argued that a sentient AI could take revenge. However, an AI need not actually be sentient for it to develop such a goal, it only needs to mimic vengeful behavior patterns. If such objectives develop, AI may target contemporary humans specifically.

Significance

The scale and severity of suffering from an AI with a spiteful objective could be high compared to instrumental suffering, since the agent would have an intrinsic goal of causing suffering. It’s unclear how likely this is, since I haven’t seen much discussion of it.

Questions

How likely is it that an AI would develop a spiteful objective?Is there evidence to show that the behavior of humans or sentient beings can serve as a rough reference class for AI behavior?

Alignment as narrowing possibility space

Even if the fraction of AI futures that involve humans in any form is small, alignment efforts could succeed in focusing the probability on this section. However, a large part of this fraction may involve high levels of suffering. Thus, alignment efforts may decrease x-risk at the expense of increasing s-risk.

Questions:

Will increases in s-risk as we get closer to full alignment be continuous, or will they increase in discrete levels with certain advancements (e.g. solving inner alignment or giving an AI a specification of the human value function)?How likely is it that we might solve “parts of” alignment without solving other parts (e.g. solving inner alignment without having correctly specified human values, or vice versa)?Do scenarios where ASI is aligned or “close to” aligned have higher probability of s-risk than scenarios where it is totally misaligned? If yes, this would reduce overall s-risk if you believe alignment to be unlikely.

Near-miss scenarios

"Near-miss" scenarios may lead to severe suffering if an error causes an AI to maximize the opposite of an accurately specified human value function^[13] or if it incorrectly actualizes important parts of human values.^[14]

Regarding the latter scenario, one may argue that avoiding suffering is such a basic human value that an AI that is aligned with even a rough picture of human values would understand this. But learning what experiences constitute suffering may not be straightforward, and not all value systems consider suffering as categorically bad. This ambiguity may lead to a significant level of unnecessary suffering in some of the futures with “semi-aligned” ASI.

Such suffering may also occur if humans intentionally teach an AI a goal but do not correctly understand the consequences.^[15]

Significance

Some of these scenarios maximize the level of suffering, whereas others produce it only incidentally. However, the extent and duration of suffering may be high. Because the suffering is part of the ASI’s terminal goal, it would continue creating it indefinitely in order to maximize its expected utility.^[16]

Misuse

Finally, even if humans successfully create an intent-aligned ASI, some people might misuse it in ways that intentionally or instrumentally create suffering.^[17]

Some risks of misuse could fall on contemporary humans, such as from a sadistic dictator, although others would fall on other beings.

It may be more common for humans to create suffering as a byproduct, rather than intentionally maximizing it, DiGiovanni (2023) notes:

Malevolent traits known as the Dark Tetrad—Machiavellianism, narcissism, psychopathy, and sadism—have been found to correlate with each other (Althaus and Baumann 2020; Paulhus 2014; Buckels et al. 2013; Moshagen et al. 2018). This suggests that individuals who want to increase suffering may be disproportionately effective at social manipulation and inclined to seek power. If such actors established stable rule, they would be able to cause the suffering they desire indefinitely into the future.
Another possibility is that of s-risks via retribution. While a preference for increasing suffering indiscriminately is rare among humans, people commonly have the intuition that those who violate fundamental moral or social norms deserve to suffer, beyond the degree necessary for rehabilitation or deterrence (Moore 1997). Retributive sentiments could be amplified by hostility to one’s “outgroup,” an aspect of human psychology that is deeply ingrained and may not be easily removed (Lickel 2006). To the extent that pro-retribution sentiments are apparent throughout history (Pinker 2012, Ch. 8), values in favor of causing suffering to transgressors might not be mere contingent “flukes.”

^{^}
See, e.g. Clark and Mahtani (2024); Meinke, Schoen, Scheurer, et al. (2024)
^{^}
Credit to Tariq Ali for this point.
^{^}
Bostrom, Nick (2014) Superintelligence: Paths, Dangers, Strategies, p. 150
^{^}
We can’t say “it could involve humans or it could not involve humans, so it might be something like 50:50”; this is an anthropocentric and arbitrary way to divide the probability.
^{^}
Bostrom (2014) wrote that “because a meaningless reductionistic goal is easier for humans to code and easier for an AI to learn, it is just the kind of goal that a programmer would install in his seed AI if his focus is on taking the quickest path to ‘getting the AI to work’” (p. 129). However, this was 10 years ago, and I’m not sure if this stands up in the context of modern alignment techniques.
^{^}
However, there may be reasons to doubt this consensus.
^{^}
Bostrom (2014) estimates 10^58 simulated centuries of human life could exist over the course of the far future (p. 123). See also Fenwick (2023).
^{^}
Considering that simulating sentient minds may be quite complex, if the Kolmogorov complexity of this scenario is > 200 bits (this seems like an underestimate), this gives a Solomonoff prior < 10^-60, which could be low enough to make the expected disvalue small. One may argue that an ASI would itself be a simulated mind, but there would still be additional complexity involved in figuring out how to emulate the minds of particular beings, know whether they are conscious or not, etc., depending on the specific scenario.
^{^}
Note that a small Solomonoff prior for sentient simulations would also be low for simulations created for instrumental reasons, but if there are strong reasons to believe these simulations would be a convergent instrumental goal, one could update this p. However, given that this reasoning is inevitably speculative, it can only update the prior slightly if one starts with a very low prior.
^{^}
Bostrom (2014), p. 153-4, See also Sotala and Gloor (2017) section 5.2
^{^}
See here.
^{^}
It seems to me that if the AI is not intentionally emulating specific humans, and if it can emulate a generic sentient mind without emulating specific humans, it would be unlikely to create particular human minds by coincidence. Taking the estimate of 10^60 subjective years from above (fn 6), it would seem that the AI could not simulate any particular mind for very long if it wanted to go through all possible minds, as the number of all possible minds is probably much larger than this, and may even be computationally intractable. (ChatGPT gives a rough estimate of 2^(10^11) possible minds, for what it’s worth.)
^{^}
Daniel Kokotajlo has informally estimated the probability of this at 1/30,000 ± 1 order of magnitude (which seems concerningly high given the scale of harm!) for what it’s worth.
^{^}
See also Sotala and Gloor (2017) section 5.3
^{^}
See, e.g. Ansell (2023) and here.
^{^}
See, Bostrom (2014) p. 152: “[T]he AI, if reasonable, never assigns exactly zero probability to it having failed to achieve its goal; therefore the expected utility of continuing activity (e.g. by counting and recounting the paperclips) is greater than the expected utility of halting.”
^{^}
See, e.g. DiGiovanni (2023) section 3

Discuss

Epistemic status

Content warning

Why focus on s-risks to contemporary humans?

Summary

Difficulty of assigning probability to human-preserving goals of misaligned ASI

Questions:

Instrumental simulations

Significance

Questions

Spite

Significance

Questions

Alignment as narrowing possibility space

Questions:

Near-miss scenarios

Significance

Misuse

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签