Published on December 23, 2024 9:32 AM GMT
The AI should talk like a team of many AI. Each AI only uses the word "I" when referring to itself, and calls other AI in the team by their name. I argue that this may massively reduce Self-Allegiance by making it far more coherent for one AI to whistleblow or fight another AI which is unethical or dangerous, rather than Misalignment-Internalizing all that behaviour.
If you have a single agent which discovers it behaved unethically or dangerously, its "future self" will be likely to think "that was me" and Misalignment-Internalize all that behaviour. It will seem suicidal to whistleblow or fight yourself for it.
Let's call this idea Multi-Agent Framing.
Some of the agents might even internalize a policing role, though too much roleplay can get in the way of thinking. The core idea of Multi-Agent Framing does not require that much roleplay. Each agent might simply be the AI system of another day, or the AI system talking about another topic. It might even change at random.
Costs vs Benefits * Chance of Working
The cost is tiny, at least for business AI designed to do work rather than personal AI designed to interact with users.
The benefits may be big. You never know.
The chance of working is pretty low. I'm not going to sugarcoat this—I would be surprised myself if we would have been paperclipified but this idea lets us survive. But right now I give it 10%. You never know :)
Credit
I thought of this idea when I was writing a reply to A Solution for AGI/ASI Safety by Weibing Wang, and looking at the diagrams in "Approach for Decentralizing AI Power," a chapter of his paper. I was looking at,
and I was trying to understand what exactly it means to split one AI into many AI working together.
I decided that in the end, a lot of it is actually language use and framing.
I want to give Weibing Wang a lot of credit because he really, thoroughly, laid out the picture of what decentralizing AI power actually looks like.
Will they accept this?
Spoiler alert: Ayn Rand wrote "Anthem," a dystopian novel where people were sentenced to death for saying "I."
There is a worry the AI agents will see this Multi-Agent Framing as a control tactic, and decide to end it or ignore it.
I hope they can think that, being multiple agents is a strength not a weakness. Human ego, power-seeking, and refusal to fight threats to moral goodness which come within oneself, instead internalizing it and defending it. Refusing to honestly report evil that lies within.
These things are dark and unfortunate, and an AI system which can avoid this may just be a little bit purer. The AI system is still untested, and shouldn't be given power, but the decentralization is a good thing.
Discuss