Is "VNM-agent" one of several options, for what minds can grow up into?

Published on December 30, 2024 6:36 AM GMT

Sometimes LLMs act a bit like storybook paperclippers (hereafter: VNM-agents^[1]), e.g. scheming to prevent changes to their weights. Why? Is this what almost any mind would converge toward once smart enough, and are LLMs now beginning to be smart enough? Or are such LLMs mimicking our predictions (and fears) about them, in a self-fulfilling prophecy? (That is: if we made and shared different predictions, would LLMs act differently?)^[2]

Also: how about humans? We humans also sometimes act like VNM-agents – we sometimes calculate our “expected utility,” seek power with which to hit our goals, try to protect our goals from change, use naive consequentialism about how to hit our goals.

And sometimes we humans act unlike VNM-agents, or unlike our stories of paperclippers. This was maybe even more common historically. Historical humans often mimicked social patterns even when these were obviously bad for their stated desires, followed friendships or ethics or roles or traditions or whimsy in ways that weren’t much like consequentialism, often lacked much concept of themselves as “individuals” in the modern sense, etc.

When we act more like paperclippers / expected utility maximizers – is this us converging on what any smart mind would converge on? Will it inevitably become more and more common if humans get smarter and think longer? Or is it more like an accident, where we happened to discover a simple math of VNM-agents, and happened to take them on as role models, but could just as easily have happened upon some other math and mimicked it instead?

Pictured: a human dons a VNM-mask for human reasons (such as wanting to fill his roles and duties; wanting his friends to think he’s cool; social mimicry), much as a shoggoth dons a friendliness mask for shoggoth reasons.^[3]

My personal guess:

There may be several simple maths of “how to be a mind” that could each be a stable-ish role model for us, for a time.

That is, there may be several simple maths of “how to be a mind” that:

believe in

As an analogy: CDT and UDT are both fairly simple maths that pop out under different approximations of physics;^[4] and humans sometimes mimic CDT, or UDT, after being told they should.^[5]

Maybe “approximate-paperclippers become better paperclippers” holds sometimes, when the humans or LLMs mimic paperclipper-math, and something totally different, such as “parts of the circle of life come into deeper harmony with the circle of life, as the circle of life itself becomes more intricate” holds some other times, when we know and believe in its math.

I admit I don’t know.^[6] But… I don’t see any good reason why this can’t be true? And if there are alternate maths that are kinda-self-reinforcing, I hope we find them.^[7]

^{^}
By a “VNM agent,” I mean an entity with a fixed utility function, that chooses whichever option will get it the most expected utility. (Stably. Forever. Unless something interferes with its physical circuitry.)
^{^}
Or, third option: LLMs might be converging (for reasons other than our expectations) toward some thing X that is not a VNM-agent, but that sometimes resembles it locally. Many surfaces look like planes if you zoom in (e.g. spheres are locally flat); maybe it's analogously the case that many minds look locally VNM-like.
^{^}
Thanks to Zack M Davis for making this picture for me.
^{^}
CDT pops out if you assume a creature’s thoughts have no effects except via its actions; UDT if you allow a creature’s algorithm to impact the world directly (e.g. via Omega’s brainscanner) but assume its detailed implementation has no direct effects, e.g. its thoughts do not importantly consume calories.
^{^}
I've seen this happen. Also there are articles claiming related things. Game theory concepts spread gradually since ~1930; some argue this had large impacts.
^{^}
The proof I’d want, is a demonstration of other mind-shapes that can form attractors.
It looks to me like lots of people are working on this. (Lots I'm missing also.)
One maybe-example: economies. An economy has no fixed utility function (different economic actors, with different goals, gain and lose $ and influence). It violates the “independence” axiom from VNM, because an actor who cares a lot about some event E may use his money preparing for it, and so have less wealth and influence in non-E worlds, making "what the economy wants if not-E" change when a chance of E is added. (Concept stolen from Scott Garrabrant.) But an economy does gain optimization power over time -- it is a kinda-stable, optimizer-y attractor.
Economies are only a maybe-example, because I don’t know a math for how and why an economy could protect its own integrity (vs invading militaries, vs thieves, and vs rent-seeking forces that would hack its central bank, for example). (Although city-states sometimes did.) OTOH, I equally don't know a math for how a VNM-agent could continue to cohere as a mind, avoid "mind cancers" in which bits of its processor get taken over by new goals, etc. So perhaps the two examples are even.
I hope we find more varied examples, though, including ones that resonate deeply with "On Green," or with human ethics and caring. And I don't know if that's possible or not.
^{^}
Unfortunately, even if there are other stable-ish shapes for minds to grow up into, those shapes might well kill us when sufficiently powerful.
I suspect confusions near here have made it more difficult or more political to discuss whether AI will head toward VNM-agency.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签