Superintelligence isn’t Approximated by a Rational Agent

Published on July 24, 2025 11:41 AM GMT

(Thank you to Jessica Taylor and others who provided feedback on this essay)

The practice of coherently ordering the good and the bad began with Plato and Socrates. In the day of the ancient Greeks, this ordering was accomplished through the method of classical dialectics, essentially by debate through dialogue about the definitions and nature of things. Today, among neo-rationalists, a similar sort of operation, the creation of coherent preferences, what is better or worse for a given goal, is accomplished through various types of calculations. Regardless of the function used to order the good vs the bad, this ordering and its coherency has remained the basis for rationality, even Plato would say that this ordering must be free from contradiction, such that an item cannot take both 1st and 5th place on the list.

Within the context of AI safety debates, a number of arguments have used this idea of coherent ordering of the good in the following way:

Agents which do not utility maximize according to coherent preferences are vulnerable to dominated strategiesSuperintelligent agents are unlikely to pursue dominated strategiesTherefore superintelligent agents will have coherent preferences, or some approximation thereof.

I intend to show that there are properties that most agree that superintelligent AI will have to possess which are incompatible with coherent preferences. Specifically, that if superintelligent AI is open-ended and capable of producing novel artifacts, as Google Deepmind researchers put it, it will necessarily have incoherent preferences due to making functions, rather than specific representations of reality (i.e. world states), its goal. I’ve previously made an argument very close to this one in previous posts on my blog, however I believe I may have used concepts and language which was unfamiliar to those in the AI Safety/Lesswrong world, so, after several long conversations with rationalists on this topic, I’m writing this post in an attempt to translate from the language of semiotics theory into something more legible. Unfortunately, there was not a way for me to begin with the more legible version, as without the concepts related to the overlap between semiotics and cybernetics, I would not have been able to articulate these ideas at all.

Background

In 2023, Elliot Thornley wrote a post calling into question if so-called “coherence theorems”, such as the one enumerated above, actually exist, specifically questioning whether arguments about dominated strategies actually imply utility maximization and coherent preferences. His argument is somewhat technical, and I’m not fully sure I understand all of its implications. It rests on the possibility of an agent having an indifference between options that is not sensitive to improving the possible benefits of one option or the other, while also not taking actions that would lead to a dominated strategy. In comments, Thornley struggled to produce some concrete examples of what this would mean in a hypothetical situation. Elizier Yudkowsky, who in many ways popularized these coherence theorems in the form Thornley critiques, responded that “I want you to give me an example of something the agent actually does, under a couple of different sense inputs, given what you say are its preferences, and then I want you to gesture at that and say, "Lo, see how it is incoherent yet not dominated!"”

Here, I wish to provide such an example of incoherency but not dominated, and specifically illustrate that the central conceit of Yudkowsky’s AI safety project, that rationality is what’s increasingly approximated with intelligence, is incorrect, due to specific limits past which rationality no longer increases the likelihood of preventing dominated strategies, whether approximately or ideally.

Previously, through the critique of the orthogonality thesis, I developed the concepts of first vs higher order signs, where first order signs correspond directly to something “real” and higher order signs only correspond to other signs. As an analogy, I compared first order signs to a single layer perceptron, or the first layer of a multilayer perceptron, and higher order signs to the map of an entire multilayer perceptron. I also discussed the concept of sign function collapse, partly inspired by Goodhart’s law, where higher order signs are collapsed to first order signs. The primary example of this sign function collapse I used was of thinking of “apple juice” as the combination of “apple” and “juice”, as the higher order sign, but then collapsing this sign to the proper name “apple juice” of a certain liquid, hence becoming a first order sign. I believe that higher order concepts, that is, abstractions, are necessary to have as goals for open-ended superintelligence to be possible, and that these higher order signs cannot be ordered coherently at all.

I hope that, for the purpose of this essay, fully understanding these concepts will be unnecessary, as I believe that there are analogous ones which I can use to make the same point. However, for readers interested in learning how this line of thought developed, you can consult my previous posts on these topics which flesh out these concepts.

Toy Model

Let’s imagine an artificial intelligence operates primarily through the use of three functions:

Function 1: This function correlates a set of variables (e.g. variables A-Z) to real world data (e.g. signals (1-50), such as through the signals received through a sensor or the patterns between these signals. This function is analogous to the training of a neural net based such as the original perceptron. If say the internal variables are 𝚿, and the real world data is 𝚾, this function would be 𝚿 = f(𝚾).
Function 2: The function takes the variables output by function 1 and outputs a matrix of the relationship between each variable, such that if we had variables B, C…Z we could get the value of A or some approximation of it. This representation of the relationship between the variables will be 𝛀, such that 𝛀 = f(𝚿), where 𝛀1 is A = f(B, C…Z), 𝛀2 is B = f(A, C…Z) and so on. 𝛀, by articulating a relationship between various categories of data/processed signals, is essentially a world model.
Function 3: This function orders all possible input function pairs, given a range of 𝚾 values for 𝛀 functions, into a list according to some rule^[1].

The big question is this, whether the rule that function 3 uses is one which can order 𝛀 in a way that makes the list coherent. Without having to go too deep into the criteria for coherency, the completeness, transitivity, ect, it’s clear that the elements of the 𝚾 inputs, and 𝛀 function pairs will need to be distinguishable from each other in such a way that a single element cannot occupy several different positions on the list relative to other elements, cannot be repeated while not adjacent in other words.

If we plug function 1 into function 2, we get something like 𝛀 = f(g(𝚾)), which is all well and good, and seems to suggest that, given a certain value of 𝚾, we are guaranteed to be able to order 𝛀. But there is a problem of timing here. By depending on the outputs of function 1, function 2 must necessarily come after function 1 in time. So really, we have to say 𝛀t = f(g(𝚾t-1)). Time, due to this causal link, always and everywhere separates the function which correlates internal representations to references in the external world from the functions which organize those representations.

Because of this lapse in time, when ordering 𝛀 we are not necessarily dealing with specific values of 𝛀, where 𝚾 is given, but 𝛀 itself as a function (e.g. of f(B, C…Z) ). This is a fundamental lesson of semiotics which I hope to illustrate, the connection between a given sign, a given variable, and its reference is always arbitrary. The function which establishes this correlation is historically contingent, particularly because the external world cannot be totally controlled relative to the internal representations. In principle, A, B, C…Z variables can be correlated to anything, and if you try to rank 𝛀 by using a shared range of possible values for these variables it quickly becomes impossible. A function (e.g. 𝛀1) could quickly come to mean something totally different from its normal operation, that is its operation given a certain state of 𝚿, if the value of the variables involved are shifted around, such as a change between 𝚿t-1 and 𝚿t. If each function of 𝛀 has the same number of possible outputs, and the inputs are not determined, then it would be impossible to distinguish between them. If they have different numbers of outputs, you could potentially tell the difference between the 𝛀 functions by the relative change in Shannon entropy between the input and output, i.e. the number of input and output variables. If we assume that each variable can either be 1 or 0, then the function A=f(B,C…Z) takes 25 bits as input and outputs 1 bit, and is a Shannon entropy reducing function. But this also becomes impossible in a world with the first law of thermodynamics, as any logical operation that reduces the Shannon entropy of the operation will also produce Landauer heat, and if the function is taken as a material thing, we would have to include this temperature change as a potential message vector, and thus put us right back to where we began.

Now, obviously this is not so much of a problem in ordinary situations, as you decrease the time between functions 1 and 2 and 3 you’ll be able to better and better approximate a value of 𝛀 given, naively, the previous value of 𝚾. Similarly, you can have secondary world models for the probability that some estimate of 𝚾 is accurate, including using Bayesian probability methods.

One problem that we run into is when a value only exists in the world of 𝛀 but not in 𝚾. 𝛀, after all, is created as a function meant to approximate the relationship between all the variables 𝚿, if you started plugging in arbitrary values into 𝛀 not found in 𝚾, that is the empirically recorded values of 𝚾, you could start getting values that weren’t originally included in 𝚾 but which might, or might not, correspond to something “real”^[2]. This is essentially the process of extrapolation, the logical possibility of all abstraction, that we might imagine, anticipate, hallucinate, new things. I might never have seen a frog in my bathroom, but I've seen bathrooms and frogs and can picture it just fine.

Let's say that for 𝛀1 there's a common range of values for 𝚾 which can produce the same desired result but which can also produce less desirable results. In this case, there is no information about which 𝚾 is more likely. Now, importantly, the AI would not be indifferent between 𝛀1𝚾1 and 𝛀1𝚾2 and 𝛀1𝚾3 depending on the value of 𝚾, of the real world data relative to its internal variables, 𝛀1 could take 1st, 4th or 10th place on its list. And no, this wouldn't be comparable to a lottery with given probabilities for each outcome, the epistemic status of each possibility is the same because they were created via the same method and no additional information about these phantom 𝚾 values is forthcoming.

Attempting to rank 𝛀 without knowing its potential outputs means you’re not ranking it by the properties of what it does, but by some arbitrary representation of the 𝛀 functions, such as the description of 𝛀 or a numbering scheme such as Godel’s. If we’re introducing a new function to create new representations of the original 𝛀 functions, we’re just moving this same problem around, as the function which would rank this new representation is still causally dependent on the function which correlates the function and representation together. Indeed, this problem still appears even if we just attempt to rank 𝚿 or 𝚾, I just wanted to use function 2 to help better illustrate the problem, as in this case it’s more obvious that adding Bayesian probability isn’t helpful due to function 2 itself being some model of the world which could have the same epistemic status as any Bayesian model. Without function 2, without knowledge of other possibilities of 𝚾, it’s not obvious how 𝚾 could vary outside of its range established by history. It takes a larger awareness to know that the external world, or our knowledge of it, can shift in ways not recorded in any specific systems.

The next logical objection is that one can prevent this problem by ensuring that one’s world model itself is coherent, thus there will always be a certain value of 𝚾 which is more or less likely. Which, fair enough, but, inevitably, this will run into the problem of logical omniscience, any actual agent that can exist will have limits to knowing what the logical consequences or implications of their beliefs are. There are some attempts to try and rank logical possibilities, e.g. whether a program with a given input will halt, by using Bayesian probability, but ultimately, as I said before, this only pushes the problem around. Any ranking function is causally dependent on some correlating/encoding function that gives the variables being ranked an actual meaning. No matter how many levels of Bayesian estimating, we'll eventually reach a point where a range of input values have the same epistemic status. With regards to logical omniscience, this is the key question of semantics, of a logical proposition really meaning the thing you think it means. Indeed, function 1 is more or less the problem of semantics. It’s a bit like the joke from 30 Rock about a made up game show called “Homonym” where contestants must guess the meaning of a spoken word that could be one of several homophones, not homonyms, and the guess is always wrong. The joke is that each guess is meaningless, because there is no right answer, there is, after all, no way to distinguish between the two meanings without additional information.

And this problem of the fundamental uncertainty of the accuracy of function 1, in this case because of its displacement in time, is the problem of only knowing the real world via representations, abstractions, like language. This problem I'm gesturing towards is not whether this external world exists or not, but simply, whether we can treat objects at the epistemic limit of semantics as anything other than equivalent, and therefore leading to contradictory ranking results when appraising the variables that are supposed to be correlated to them.

Another potential solution is averaging the range of possible outputs, which to me seems totally arbitrary compared to the sample space of possible operations you could use to get the range down to one value. Similarly, up until now we've been assuming the range of unknown possible inputs is well defined, but that too is an exercise in epistemic hubris. If it's not just the precise value that is arbitrary, but also the probability distribution and range of the values that is arbitrary, then averaging too will be a futile exercise. This also seems related to the questions of existential risk which plague contemporary rationalism so much. A situation where it's clear that indeed both the values of the relevant equations and the range of all possible values have essentially the same epistemic status.

Example of Incoherent but not Dominated

As I said earlier, in most situations this uncertainty doesn’t matter for rationalist models of mind because you can naively assume that the most recent value of 𝚿, the coded pairs of internal values and external data, or some modeled value of 𝚿 based on known rules is accurate and be approximately correct since we have good data and theory on what 𝚿 should be. But there are some situations where it does matter, and these situations are actually very important to our understanding of superintelligence.

Take a scientist for example pursuing research that is at the theoretical frontiers. The scientist has some set of functions 𝛀 which gives them the expectation that they might observe new types of 𝚾 signals, they know a range of situations they might observe this signal, and have expectations of a range of signals they might observe. Depending on what the signal means, it could have drastically different consequences (e.g. whether alien life is detected by a telescope array and how it is detected). Choosing function 𝛀n means pursuing the search for alien life, even though the possible consequences of this search are ranked wildly differently on their list of preferences (e.g. alien invasion, peaceful first contact, winning nobel prize but nothing else happens, ect.). Although there are many arguments about the likelihood of alien life and what type of life that would be, there is not much difference between the epistemic status of the mainstream scientific arguments, and for our hypothetical scientist, they see no difference at all. In choosing to search for aliens, they are not ranking their preferences coherently. And yet, is the scientist choosing a dominated strategy?^[3] What are they losing by choosing this path? Obviously, they are losing time they could spend doing other things, but if their preference is for the function of searching for alien life itself, the incoherent consequences be damned, it doesn’t really matter. What's important is that the scientist's choice is based not just on what they know about the life of a scientist but by the thrill of the unknown and discovery itself. They are taking a leap into the unknown, and this leap is always the choice of accepting an operation where either the input or the output are totally uncertain. And this unknown isn’t something approximated as a limit of learning about the world, the more we know about the world, the more things we’re aware that we’re ignorant of. When taking a synchronic approach to this analysis, that is looking at a particular slice of time, there will always be this realm of indeterminacy of inputs somewhere, and it's in the realm of the synchronic where decision making takes place, any given function needs a specific input, and the function which selects an action or goal to pursue would be no different.

One nice side effects of the goal choice of the scientist is that being the first to discover something is a good way to prevent your strategies from being dominated, if some logic or fact is asymmetrically known it can be used against the ignorant, thus, as a byproduct of this incoherent goal seeking behavior, dominated strategies can actually be avoided.

Science is necessarily full of such motivations, the search for knowledge being its own goal. This search for knowledge for its own sake entails setting goals which only exist as abstractions, which do not yet have specific values associated with them which can only be known ex post a discovery. This search for things which do not yet exist within our system of knowledge is how we learn new things, or, in the words of the Google Deepmind team, produce “novel artifacts”.

According to the Deepmind researchers, and I think also according to the common sense understanding of the term, superintelligence should be open-ended. They define it like this: “From the perspective of an observer, a system is open-ended if and only if the sequence of artifacts it produces is both novel and learnable”. Super-intelligence, if not logically omniscient, should be capable of learning new logical facts, of advancing science, and teaching us these new discoveries. And in order to make such discoveries consistently, this super-intelligence would need to make as a goal functions with indeterminate inputs, as it doesn’t yet know which input-function pair will produce the desired result.

Therefore, the scientist working on the cutting edge, whether human or AI, will necessarily cease to approximate a coherent utility function and utility maximizing behavior.

Of course, I also think, based on experience, that most people do not have utility maximizing behavior based on coherent preferences either. So if coherent preferences don’t apply to either normal human intelligence or super-intelligence, it doesn’t seem like a very useful concept to understand intelligence with at all. Indeed, although I said earlier that rationalist models of mind were workable in ordinary situations, no human experience, no matter how mundane, is totally free from things out of the ordinary. People face epistemic limits purely in the normal vicissitudes of life, and they cast their lot regardless. As for how they decide in these cases, it's my opinion that it is due to the correlation of these various operations to internal concepts like their sense of self or the good, in all their contradictions.

This post originally appeared on my blog, Pre-History of an Encounter.

^{^}
Assuming that some of the variables included in A-B is something the AI can control, it’ll be capable of agentic behavior.
^{^}
This frequentist example is just for illustration, the underlying principle here is the semantic uncertainty which would exist in any epistemic approach.
^{^}
It’s worth noting that what exactly this strategy is, relative to other utility functions, is dependent on the units of differentiation between choices. Some objections to this characterization of being a not dominated strategy based on a non-coherent utility function might be that since this choice is compatible with other, coherent, utility functions, it shouldn’t count, however, if the scientist’s actions are sufficiently differentiated from these other choices, based on the arbitrary assignment of each “unit” of choice, then there is no problem at all. Just as well, the arbitrariness of units here strikes me as a potential general problem for conjectures about utility functions being tied to specifically dominated or non-dominated strategies.

Discuss

Background

Toy Model

Example of Incoherent but not Dominated

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签