Published on October 11, 2024 9:02 PM GMT
To recap:
Tiling agents are logically more powerful than their successor agents. [ This is how they can verify them even though first-order logics can't verify themselves. ]
The 5-and-10 problem:
"I have to decide between $5 and $10. Suppose I decide to choose $5. I know that I'm a money-optimizer, so if I do this, $5 must be more money than $10, so this alternative is better. Therefore, I should choose $5."
[End recap]
Maybe this is dense, but I've never heard a plausible-sounding story about how the 5-and-10 problem [or analogous problems involving reflective reasoning about "logical counterfactuals"] is supposed to become a problem, that on reflection don't seem to me like they could be solved by augmenting the bolded integument above, with a subjective gut values check to ground all incoming logical proofs of what I choose.
As a limiting case of even the formal version of the problem: Even if I am reasoning about a model of an exact copy of myself having already chosen the $5, the model is going to have to be somewhat simplified anyway since I am not smaller than myself. In the same way that theoretical tiling agents verify that their [smarter] successors' utility functions maintain certain invariants by making them [the successors] logically weaker, I should just as well be able to make my model of myself as having picked the $5 [which is the same intelligence as I am] logically weaker than myself, and verify whether or not its utility function has just as many invariants maintained.
Tiling agents as currently imagined are computationally intractable for the obvious reason that if you want a chain of N successors all maintaining the invariants, your zeroth successor has to have ~N degrees of logical power [depending on how you count].
Verifying a temporary self-model, on the other hand, that is used to make a decision and then discarded, is just the "base case" of the tiling-agent idea - you only need one level of logical remove on the "stack".
The foregoing pseudo-formalism is one thing that could bear load in one possible type of values check involving reflection. I have no idea what else you would need to implement a reflective values check, and I certainly don't consider myself to have solved how to actually implement the base case of tiling agents in practical best-practiced terms! But in any case, unless I'm missing something, 5-and-10 doesn't seem any harder than the very base case of Tiling Agents.
Benja Fallenstein seems to have noted in a 2014 paper that Tiling Agents beats 5-and-10:
However, there is an important difference, which makes Slepnev’s system vulnerable to what is called the “5-and-10 problem”, whereas Yudkowsky and Herreshoff’s system is not. Essentially, this is because the latter system only considers what happens if the action is taken that the agent actually ends up taking, whereas Slepnev’s system also considers what would have happened if the agent had taken a different action.
[emphases mine]
I see no reason this could not be generalized, using temporary self-models at every decision-junction where 5-and-10 would otherwise become a problem.
However, I could still be missing something.
Am I missing something?
Discuss