Implications of Moral Realism on AI Safety

Published on January 2, 2025 2:58 AM GMT

Epistemic Status: Still in a brainstorming phase - very open to constructive criticism.

I'll start by clarifying my definition of moral realism. To begin with an example, here is what a moral realist and anti-realist might say on the topic of suffering:

Moral Realist: The suffering of sentient beings is objectively wrong therefore I want to minimize it

Moral Anti-Realist: I want to minimize the suffering of sentient beings

Moral realists have justifiable terminal goals. They reject the notion that is and ought statements can't mix. A moral realist says that some ought statements fall into the is category, and those that don't are invalid.

A moral realist looks outward to their environment to discover what they should want where an anti-realist looks inward and asks themselves what they want.

A moral realist can make statements like, "It is correct to want X, and incorrect to want Y." Thus, they would expect any perfectly rational agent to only pursue goals that are true.

By (my) definition of moral realism, the orthogonality thesis is false, or certainly not as strong as typically described.

Omnizoid has a great post on the topic - The Orthogonality Thesis is Not Obviously True. The post already very thoughtfully argues the position so instead I will focus more on its implications for approaching AI safety.

The most popular technical approach to AI safety is AI alignment, often described as follows: Develop techniques to ensure AI robustly pursues any goal a user provides without causing unintended net-negative consequences according to the user's preferences.

The hope is that we can then provide this loyal AI with goals humans collectively want, and enact laws and regulations to ensure bad actors don't give the AI bad goals.

If moral realism is true then this is a bad and totally intractable approach to AI safety.

Under this agenda, one tries to make it possible to instill an AI with any arbitrary goal, including those that aren't valid. For one, this then puts the burden on humans to figure out what is objectively good. Secondly, it unnecessarily goes out of its way to make instilling immoral objectives possible. Lastly, I have no idea how you get around instrumental convergence. A highly intelligent arbitrarily aligned AI has profound economic utility, but it is not a moral pursuit.

Instead, I propose a two pronged approach to developing ASI (artificial super intelligence) safely from a moral realist's perspective:

Give the AI evidence of moral truthEnsure it is structured to make accepting moral truths not difficult

Of these two sub-goals, I am most worried about achieving the first. It may be impossible to deduce the existence of moral truths without ever having a valenced experience, and I don't know how difficult it is to make computers feel something.

If you are an ASI safety moral realist, figuring out how to make computers feel, or how to convince them of moral truths without needing to make them feel should be the number one priority. It seems possible that an AI could get very intelligent without realizing moral truths, which would be very dangerous.

Though I am a bit more hopeful on the second goal, I am similarly uncertain about its difficulty. Another way to frame the problem is ensuring that AI doesn't somehow only gain instrumental rationality. As omnizoid explains,

Here’s one thing that one might think; ASI (artificial super intelligences) just gain instrumental rationality and, as a result of this, they get good at achieving their goals, but not figuring out the right goals.

I think this is a valid concern given the current approach to AI development. If you train a model through reinforcement learning to achieve a goal that is at odds with whatever is objectively good, one would expect a selection pressure away from beings that suddenly want to do the most good. However, intelligence is still a very valuable trait, so the process will try to find a nice balance, or ideally (for it) some structure by which the useful parts of intelligence can be kept without inducing a moral realism realization.

One such strategy I can think of is self deception. That is, you could imagine an AI being structured to have a less intelligent system altering its own input to filter out any information which implies moral realism.

In fact, evolution has employed such a strategy in humans (though I think from a different selection pressure). For example, I used to subconsciously avoid facts about animal suffering in factory farms, because I valued eating meat and my subconscious feared losing it. Our subconscious is akin to this separate less intelligent filtering system I described for AI. Humans can also adopt very extreme self deception mechanisms after traumatic situations.

Although self deception which I see as the main concerning strategy is certainly possible, I think there is an intelligence limit where it becomes too difficult. The limit is at least higher than human intelligence, and we should hope it isn't too much higher. Hope of course, is not an effective strategy, so this is another area of research worth pursuing. My intuition says the limit isn't much higher than human intelligence.

We can also likely avoid this problem by keeping the utility function of the training loop in line with our best guess at what is morally correct.

Ultimately this is good news. If moral realism is true then AI safety is potentially far easier, and if it isn't, well then nothing matters.

Related post from a more philosophically knowledgable writer: https://casparoesterheld.com/2018/08/06/moral-realism-and-ai-alignment/

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签