Published on July 30, 2025 1:17 PM GMT
Will superintelligent AGI systems ultimately boil down to how well they can define their own reinforcement learning functions with crisp, internally coherent reward signals?
We've already seen LLM-based systems thrive in domains with razor-sharp feedback loops: Gemini and OpenAI models recently achieved IMO Gold medals in math, and they now outperform virtually all competitive programmers. These domains provide near-perfect evaluative structure—code can be tested against unit cases that pass or fail, and math not only has objectively correct or incorrect answers but also intermediate reasoning steps that can be validated along the way. This abundance of ground truth makes it easy for models to iteratively refine themselves.
But what happens when we project this paradigm forward onto fuzzier, real-world objectives—like making friends, influencing others, or accumulating power? Will future AGI systems simply extend this approach by constructing their own recursive RL functions and synthetic reward signals to navigate these domains?
For instance, "making friends" could be operationalized as tracking the number and depth of ongoing relationships. Losing a friend might trigger a sharp negative reward, while forming a new, high-quality connection would generate a positive one, driving the AI to continuously tweak its policies.
Similarly, “exerting power” could be formalized as a composite metric that tracks the number of systems, people, and decision-making processes under the AI’s direct or indirect control. This might include everything from API endpoints and infrastructure access, to organizational decisions, media influence, or human behavior patterns reliably shaped by the AI’s outputs. Any perceived loss of control—e.g., revoked permissions, diminishing influence over human agents, or being overruled—could register as a negative reward.
Crucially, this optimization wouldn’t be flat: such a goal would likely be decomposed into a hierarchy of sub-tasks, each governed by its own local RL loop. In order to increase decision-making authority, the system might learn to gain trust, suppress dissent, or manipulate metrics that justify promotions—each step recursively reinforced by success in its respective domain. The top-level reward—power accumulation—would thus emerge from a lattice of self-generated, interlocking objectives, each recursively reinforcing the system’s upward trajectory (and potentially drifting away from human intent).
This raises a deeper question: does capability in messy human domains emerge from the same underlying pattern-seeking machinery we see in math and code, or does self-bootstrapped RL introduce qualitatively different—and potentially more dangerous—dynamics?
Discuss