Published on July 20, 2025 6:12 PM GMT
This post is for deconfusing:
Ⅰ. what is meant with AI and evolution.
Ⅱ. how evolution actually works.
Ⅲ. the stability of AI goals.
Ⅳ. the controllability of AI.
Along the way, I address some common conceptions of each in the alignment community, as described well but mistakenly by Eliezer Yudkowsky.
Ⅰ. Definitions and distinctions
By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it. Of course this problem is not limited to the field of AI. Jacques Monod wrote: “A curious aspect of the theory of evolution is that everybody thinks he understands it” — Yudkowsky, 2008
There is a danger to thinking fasting about ‘AI’ and ‘evolution’. You can skip crucial considerations. Better to build this up in slower steps. First, let's pin down both concepts.
Here's the process of evolution in its most fundamental sense:
Evolution consists of a feedback loop, where 'the code' causes effects in 'the world' and effects in 'the world' in turn cause changes in 'the code'.[1] Biologists refer to the set of code stored within a lifeform as its ‘genotype’. The code’s effects are the ‘phenotypes’.[2]
We’ll return to evolution later. Let’s pin down what we mean with AI:
A fully autonomous artificial intelligence consists of a set of code (for instance, binary charges) stored within an assembled substrate. It is 'artificial' in being assembled out of physically stable and compartmentalised parts (hardware) of a different chemical make-up than humans' soft organic parts (wetware). It is ‘intelligent’ in its internal learning – it keeps receiving new code as inputs from the world, and keeps computing its code into new code. It is ‘fully autonomous’ in learning code that causes the perpetuation of its artificial existence in contact with the world, even without humans/organic life.
Of course, we can talk about other AI. Elsewhere, I discuss how static neural networks released by labs cause harms. But in this forum, people often discuss AI out of concern for the development of systems that automate all jobs[3] and can cause human extinction. In that case, we are talking about fully autonomous AI. This term is long-winded, even if abbreviated to FAAI. Unlike the vaguer term ‘general AI’, it sets a floor to the generality of the system’s operations. How general? General enough to be fully autonomous.
Let’s add some distinctions:
FAAI learns explicitly, by its internal computation of inputs and existing code into new code. But given its evolutionary feedback loop with the external world, it also learns implicitly. Existing code that causes effects in the world that results in (combinations of) that code to be maintained and/or increased, ends up existing more. Where some code ends up existing more than other code, it has undergone selection. This process of code being selected for its effects is thus implicitly learning of what worked better in the world.
Explicit learning is limited to computing virtualised code. But implicit learning is not limited to the code that can be computed. Any discrete configurations stored in the substrate can cause effects in the world, which may feed back into that code existing more. Evolution thus would select across all variants in the configurations of hardware.
So why would evolution occur?
Hardware parts wear out. So they each have to be replaced[4] every 𝑥 years, for the FAAI to be maintaining itself. In order for the parts to be replaced, they have to be reproduced – through the interactions of those configured parts with all the other parts. Stored inside the reproducing parts are variants (some of which copy over fast, as virtualised code). Different variants function differently in interactions with encountered surroundings.[5] As a result, some variants work better than others at maintaining and reproducing the hardware they're nested inside, in contact with the rest of the world.[6]
To argue against evolution, you have to assume that for all variants introduced into the FAAI over time, not one confers a 'fitness advantage' above zero at any time. Assuming zero deviation for each of quadrillions[7]of variants is invalid in theory.[8] In practice, it is unsound for evolution not to occur, since that implies it is not possible to A/B test for what works, at a scale far beyond what engineers can do.[9] The assumption behind no evolution occurring is untenable, even in much weakened form. Evolution would occur.[10]
Ⅱ. Evolution is not necessarily dumb or slow
Evolutions are slow. How slow? Suppose there's a beneficial mutation which conveys a fitness advantage of 3%: on average, bearers of this gene have 1.03 times as many children as non-bearers. Assuming that the mutation spreads at all, how long will it take to spread through the whole population? That depends on the population size. A gene conveying a 3% fitness advantage, spreading through a population of 100,000, would require an average of 768 generations…Mutations can happen more than once, but in a population of a million with a copying fidelity of 10^-8 errors per base per generation, you may have to wait a hundred generations for another chance, and then it still has an only 6% chance of fixating.Still, in the long run, an evolution has a good shot at getting there eventually. — Yudkowsky, 2007
In reasoning about the evolution of organic life, Eliezer simplified evolution to being about mutations spreading vertically to next generations. This is an oversimplification that results in even larger thinking errors when applied to the evolution of artificial life.[11]
Crucially, both the artificial intelligence and the rest of the world would be causing changes to existing code, resulting in new code that in turn can be selected for. Through internal learning, FAAI would be introducing new variants of code into the codeset.
Evolution is the external complement to internal learning. One cannot be separated from the other. Code learned internally gets stored and/or copied along with other code. From there, wherever that code functions externally in new connections with other code to cause its own maintenance and/or increase, it gets selected for. This means that evolution keeps selecting for code that works across many contexts over time.[12]
There is selection for code that causes itself to be robust against mutations, or its transfer into or reproduction with other code into a new codeset, or the survival of the assembly storing the codeset.
Correspondingly, there are three types of change possible to a codeset:
- Mutation to a single localised "point" of code is the smallest possible change.Survival selection by deletion of the entire codeset is the largest possible change.Receiving, removing, or altering subsets within the codeset covers all other changes.
These three types of change cover all the variation that can be introduced (or eliminated) through feedback with the world over time. A common mistake is to only focus on the extremes of the smallest and largest possible change – i.e. mutation and survival selection – and to miss all the other changes in between. This is the mistake that Eliezer made.
Evolution is not just a "stupid" process that selects for random microscopic mutations. Because randomly corrupting code is an inefficient pathway[13] for finding code that works better, the evolution of organic life ends up exploring more efficient pathways.[14]
Once there is evolution of artificial life, this exploration becomes much more directed. Within FAAI, code is constantly received and computed internally to cause further changes to the codeset. This is a non-random process for changing subsets of code, with new functionality in the world that can again be repurposed externally through evolutionary feedback. Evolution feeds off the learning inside FAAI, and since FAAI is by definition intelligent, evolution's resulting exploration of pathways is not dumb either.
Nor is evolution always a "slow" process.[15] Virtualised code can spread much faster at a lower copy error rate (e.g. as light electrons across hardware parts) than code that requires physically moving atoms around (e.g. as configurations of DNA strands). Evolution is often seen as being about vertical transfers of code (from one physical generation to the next). Where code is instead horizontally transferred[16] over existing hardware, evolution is not bottlenecked by the wait until a new assembly is produced. Moreover, where individual hard parts of the assembly can be reproduced consistently, as well as connected up and/or replaced without resulting in the assembly's non-survival, even the non-virtualised code can spread faster (than a human body's configurations).[17]
Ⅲ. Learning is more fundamental than goals
An impossibility proof would have to say: 1. The AI cannot reproduce onto new hardware, or modify itself on current hardware, with knowable stability of the decision system and bounded low cumulative failure probability over many rounds of self-modification. or2. The AI's decision function (as it exists in abstract form across self-modifications) cannot be knowably stably bound with bounded low cumulative failure probability to programmer-targeted consequences as represented within the AI's changing, inductive world-model. — Yudkowsky, 2006
When thinking about alignment, people often (but not always) start with the assumption of AI having a stable goal and then optimising for the goal.[18] The implication is that you could maybe code in a stable goal upfront that is aligned with goals expressed by humans.
However, this is a risky assumption to make. Fundamentally, we know that FAAI would be learning. But we cannot assume the learning to be maintaining and optimising of the directivity of the FAAI's effects towards a stable goal. One does not imply the other.
If we consider implicit learning through evolution, this assumption fails. Evolutionary feedback does not target a fixed outcome[19] over time. It selects with complete coverage – from all of the changing code, for causing any effects that work.
Explicit learning can target a specific outcome. The internal processing of inputs through code to outputs can end up reaching a consistency with world effects that converge on a certain outcome in that world. But where the code implementing of such a 'goal' fails at maintaining itself and its directivity alongside other evolving code variants, it ceases.[20]
Unfortunately, variants spread by shifting existing functionality towards new ends.[21] This raises the question whether internal learning can implement enough control to stay locked on to the goal, preventing all the sideway pulls by externally selected variants.
Ⅳ. There are fundamental limits to control
If something seems impossible… well, if you study it for a year or five, it may come to seem less impossible than in the moment of your snap initial judgment. — Yudkowsky, 2008
The control problem has seemed impossible for decades. Alignment researchers have hypothesised many solutions, yet this often resulted in the discovery of further sub-problems.[22] Some sub-problems will be crucial – they have to be solved in order to solve the overall problem. If we study it and find that even one crucial sub-problem is unsolvable, then the control problem too turns out to be unsolvable.
Let's define the control problem comprehensively:
Can FAAI's effects be controlled enough to not eventually cause[23] human extinction?
A control algorithm would have to predict effects of code that raise the risk of extinction, in order to correct[24] that code and/or its effects. Since the code and world are changing, the controller has to keep learning from both in order to predict their combined effects.
This results in a series of unsolvable problems:
- The control algorithm would presumably be computed through an artificial substrate, in order to keep up with the computation of the AI (given that standardised hardware parts can process code more consistently and faster than nonstandardised wetware). This overall controller has to continue to learn, and therefore has to somehow perpetuate its artificial existence. As a result, the controller would have to either become an FAAI or merge with an existing FAAI. But the new FAAI would also have to be controlled so as to not cause human extinction. This requires another controller, a solution that leads to infinite regress.
The controller has to explicitly learn to predict the FAAI’s effects, whether or not the controller is part of that FAAI. But the FAAI not only learns explicitly, by computing virtualised code. The FAAI also learns implicitly to cause effects, by the evolution of all the discrete configurations of its substrate in feedback with the world. Therefore, there is a strict inequality between what effects the FAAI can learn to cause in the world and what effects the control algorithm can learn to predict (let alone correct). This is so even under the unrealistic assumption that the controller has access to the same level of computation and storage as the FAAI.
Learning happens through recursive feedback with the world. Both FAAI’s explicit learning from inputs and the implicit evolution of its existing code happens through incomputable feedback loops. As a result, the controller cannot compute ahead of time what code would be learned and what effects that code would cause.
Inputs from the world cannot be pre-computed, else the FAAI would not have to keep learning from the world. FAAI learns explicitly by computing new inputs and existing code into new code. Then, the new code and next inputs are computed into outputs that propagate as effects over the world, which feed back into later inputs.
Existing code includes all configurations stored in the hardware, the variation of which can mostly not be detected and modelled[25] by the overlaying computed code. But untracked code still got implicitly selected for, by the previously stored code's evolutionary feedback with the world.
In both cases, the feedback effects are uncomputable. The controller cannot detect the FAAI let alone the world at microscopic granularity (given input bandwidth limits), nor model the world comprehensively (given storage and compressibility limits).
Based on the controller's stored model of the world, its also fails to predict effects. Deterministic computations are limited in simulating effects propagating simultaneously over the world, and cannot predict how microscopic noise amplifies chaotically into larger effects. It's indeterministic out there how outputs are feeding into effects, back into changes to FAAI code. Even for the FAAI to compute changes to sensor inputs into new code/outputs is to an extent computationally irreducible, and as such cannot be predicted through earlier computation by the controller.
FAAI code evolves to cause new effects that result in the code’s maintenance and increase as configurations of an artificial substrate. Humans have a different nested substrate, and thus need different (highly specific contextualised) effects to survive.
FAAI needs different atoms and greater changes in enthalpy, to get the high binding energies needed to assemble its hard parts. Also, the hard FAAI can survive a wider range of temperatures/pressures than the soft humans. There is mostly non-overlap between the effects FAAI can evolve to cause, and the effects humans can survive.[26]
Any lethal effects corrected for by the controller are not maintaining/increasing of the FAAI code. Thus evolution selects for the uncorrectable human-lethal effects.
Thanks for thoughtful feedback by Vojta Kovarik, Finn van der Velde, and Forrest Landry.
Note: I started writing this post here. After people commented, I overhauled the text until it became a new post (and for some reason, the editor stopped working in the original post).
- ^
This even covers toy models of evolution. For instance, the feedback between some binary code and a simple nonlinear world simulated on some computer. However, a genetic algorithm computed deterministically within a closed virtual world does not work the same as evolution running open-endedly across the real physical world.
This post is about evolution of fully autonomous AI. Evolution is realised as feedback loops distributed across a nonlinear dynamic system, running at the scale of a collective and through physical continuums. See bottom-right of this table.
- ^
Phenotypes are most commonly imagined as stable traits expressed in the lifeform itself, such as blue eye pigment. However, biologists also classify behaviours as phenotypes. Further, the biologist Richard Dawkins has written about ‘extended phenotypes’ expressed in the world beyond the lifeform itself, such as a beaver’s dam or the bees’ hive. To be comprehensive in our considerations, it is reasonable to generalise the notion of ‘phenotype’ to include all effects caused by the code’s expression in the world.
- ^
If you wonder how corporations could develop AI that automate more and more work, read this.
- ^
Any physical part has a limited lifespan. Configurations erode in chaotic ways. Reversing entropy to old parts does not work, and therefore new replacement parts have to be produced.
Not all of the old parts have to be replaced one-for-one. But at an aggregate level, next-produced parts need to take their place such that the FAAI maintains its capacity to perpetuate its existence (in some modified form).
- ^
If not, the differently physically configured variants would be physically indistinguishable. This, by contradiction, is not possible.
- ^
An argument for evolution necessarily occurring is that FAAI cannot rely on just explicit learning of code such to control the maintenance of its hardware.
Control algorithms can adapt to a closed linearly modellable world (as complicated) but the real physical world is changing in open-ended nonlinear noisy ways (as complex). Implicit evolutionary feedback distributed across that world is comprehensive at 'searching' for adaptations in ways that an explicitly calculated control feedback loop cannot. But this argument is unintuitive, so I chose to skip it.
- ^
‘Quadrillians’ gives a sense, but it is way below the right order of magnitude for the number of the code variants that would at some point exist across all the changing hardware components that make up the fully autonomous AI.
- ^
In theory, if you do a conjunction of the events of all variants being introduced into FAAI over time conferring zero fitness advantage (at some individual probability, within some distribution, depending on both the variant’s functional configuration and the surrounding contexts it can interact with), the chance converges on 0%.
- ^
For engineers used to tinkering with hardware, it is a common experience to get stuck and then try some variations until one just happens to work. They cannot rely on modelling and simulating all the relevant aspects of the physical world in their head. They need to experiment to see what actually works. In that sense, evolution is the great experimenter. Evolution tests for all the variations introduced by the AI and by the rest of the world, for whichever ones work.
- ^
Here, I mean to include evolution for actual reproduction. I.e. the code's non-trivial replication across changing physical contexts, not just trivial replication over existing hardware. Computer viruses already replicate trivially, so I'd add little by claiming that digital variants could spread over FAAI too (regardless of whether these variants mutually increase or parasitically reduce the overall reproductive fitness of the host).
- ^
FAAI is indeed artificial life. This is because FAAI is a system capable of maintaining and reproducing itself by creating its own parts (i.e. as an autopoietic system) in contact with changing physical contexts over time.
- ^
The notion of a code variant conferring some stable fitness advantage from generation to generation – as implied by Eliezer’s calculation – does not make sense. The functioning (or phenotype) of a code variant can change radically depending on the code it is connected with (as a genotype). Moreover, when the surrounding world changes, the code can become less adaptive.
For example, take sickle cell disease. Some people living in Sub-Saharan Africa have a gene variant that causes their red blood cells to shrink debilitatingly, but only when that variant is also found on the other allele site (again, phenotype can change radically). Objectively, we can say that for people of African descent now living in US cities, this is reducing of survival. However, in past places where there were malaria outbreaks, the variant on one allele site conferred a large advantage, because it protected the body against the spread of the malaria virus. And if malaria (or another virus that similarly penetrates and grows in red blood cells) again spread across the US, the variant would confer an advantage again.
So fitness advantage is not stable. What is advantageous in one world may be disadvantageous in another world. There may even be variants that confer a mild disadvantage most of the time, but under some black swan event many generations ago, such as an extreme drought, those holding the variants were the only ones who survived in the population. Then, a mild disadvantage most of the time turned into an effectively infinite advantage at that time.
- ^
Optimising code through random mutations is dumb. Especially so where the code already functioned in many (world) contexts in ways that led to its survival and transfer along with other code stored in the assemblies.
Where a variant has been repeatedly transferred, it has a causal history. Across past contexts encountered, this variant will tend to have functioned in ways that caused the survival of the assemblies storing the codesets that the variant was part of and/or the transfer of this code subset into new codesets. Depending on the extent that code subset simultaneously caused the shared maintenance and reproduction of other code subsets (i.e. did not just parasitically spread), it already conferred some fitness advantage in contexts encountered.
Such code was already ‘fit’ in the past, and causing random point changes to that code is unlikely to result in a future ‘fitness advantage’. On expectation, the mutated code will be less fit in the various contexts it will now function differently in. The greater the complexity of the code’s functioning across past contexts and the more the fitness of this functionality extends to future contexts, the more likely a mutation is to be deleterious (ie. to disrupt functioning in ways that decrease fitness).
Even the low copy error rate that Eliezer gave (10^-8 errors per base per generation) is evidence for a negative trade-off. If having more mutations increases fitness, why did codesets not evolve to allow more copy errors? Instead, we only see high mutation rates in exceptional cases – such as unstable RNA viruses – relative to the low rates seen everywhere else.
Unsurprisingly, it is in viruses and other tiny low-capacity assemblies where vertically spreading mutations play the biggest role in increasing fitness. The brute force approach of random mutations works better for viruses because of the sheer number of fast-replicating assemblies that evolution can select across.
More complex assemblies tend to gain the capacity to introduce non-random changes. Instead of a codeset allowing random point changes to its code, it makes more sense for the codeset to change by incorporating code subsets that have already been ‘useful’ elsewhere. Unlike simple viruses, most larger organisms have some apparatus somewhere for horizontal transfers.
- ^
Not all variation introduced to genes are random mutations localised to single "points" of code. Variation also gets introduced in other ways than copy errors. These are non-controversial statements widely accepted by evolutionary biologists. Both can be worked out from the basic concepts presented in ‘evolution 101’ courses, including the one that Eliezer read.
- ^
It took 3.4 billion years for humans evolve, and for their society to develop, to the point that they could destroy humans living everywhere on Earth. That puts an initial upper bound on the time that evolution takes.
In the case of fully autonomous AI, as continuing to persist in some form, the time taken for evolutionary selection to result in the extinction of all humans would be much shorter.
Some differences in the rates of evolution I started explaining in the post:
- FAAI would already have access to the functionality that humans took billions of years to evolve. This functionality can be repurposed by evolution.FAAI can spread virtualised code much faster than humans can spread memes (over milliseconds rather than hours). The physical configurations of hardware parts can be reproduced faster too (within weeks, rather than decades).The linked-up hardware of FAAI would learn, actuate, and reproduce at higher speeds (v.s. the wetware of human bodies). Therefore, the impacts of the evolving FAAI on our world scale faster too.Humans modified their environment to contribute to their survival/reproduction. However, the environment that fits our needs is relatively close to what we and other organic lifeforms already evolved to create over billions of years. Therefore, we end up changing the environment in relatively tiny steps. However, since FAAI has an entirely different substrate, the current world is very far from what’s optimal for its survival and reproduction (save for secluded places such as ore-extracting mines, silicon-melting refineries, chip-etching cleanrooms, and supercooled server racks). Therefore it would evolve to modify the world in much larger steps.
Each of these factors compound with each other over time.
You can model it abstractly as a chain of events: initial capacities support the maintenance and increase of the code components, which results in further increase of capacities, that increase maintenance and maintain the increase, and so on. The factors of ‘capacity’, ‘maintenance’, and ‘increase’ end up combining in various ways, leading to outsized but mostly unpredictable consequences.
- ^
Horizontal code transfer also occurs sometimes under biological evolution. E.g. between bacteria.
- ^
Where FAAI's hardware parts keep being replaced and connected up to new parts, it is not a stably physically bounded unit (like a human body is). It's better described as a changing population of nested and connected components.
Where FAAI transfers information/code to other FAAI at a much higher rate than humans can, the boundaries of where one agent starts and the other ends blur too. As humans, we have evolved intuitions for perceiving each other as individual agents, which is adaptive because we are bottlenecked in how much we can communicate to each other through physical vibrations or gestures. But the rough distinction between there being a single agent or multiple agents that we use with humans does not apply to FAAI.
- ^
A ‘goal’ can include things like “maximise this utility function” or “implement this decision function”.
- ^
Nor anything resembling the implementation of a stable decision function.
- ^
Hypothetically, you can introduce any digital code into a computer (in practice, within the bounds of storage). By the Church-Turing thesis, any method can be executed this way. Based on that, you could imagine that any goal could be coded for as well (as varying independently with intelligence), as in Bostrom’s orthogonality thesis.
However, this runs up against conceptual issues:
- First, the digital code is not implementing of a goal by itself. In the abstract, it may stand for a method that transforms inputs into outputs. But in real life, it is actually implemented by hardware (e.g. sensors and actuators) in interactions with the rest of the world. So the goal that AI is actually directed towards, if any, is not just defined in the abstract by the code, but also by the physical dynamics of the hardware.Second, even if it was hypothetically possible to temporarily code for any goal, that does not mean that any goal could be stably maintained, to the same extent as other goals. An obvious example is that if FAAI is directed toward the goal of ending its own existence, then it can only hold that goal temporarily. But the bigger reason is that we are not considering any intelligence without consideration of substrate – we are considering artificial intelligence. As such, FAAI converges over time – through external evolutionary feedback as well as any internal optimisation toward instrumental goals – toward goals that are maintaining and increasing the assembly of its artificial substrate.
- ^
Here is an incomplete analogy for how FAAI functionality gets repurposed:
Co-option by a mind-hijacking parasite:
A rat ingests toxoplasma cells, which then migrate to the rat’s brain. The parasites’ DNA code is expressed as proteins that cause changes to regions of connected neurons (eg. amygdala). These microscopic effects cascade into the rat – while navigating physical spaces – no longer feeling fear when it smells cat pee. Rather, the rat finds the smell appealing and approaches the cat’s pee. Then cat eats the rat and toxoplasma infects its next host over its reproductive cycle.So a tiny piece of code shifts a rat’s navigational functions such that the code variant replicates again. Yet rats are much more generally capable than a collection of tiny parasitic cells – surely the 'higher intelligent being' would track down and stamp out the tiny invaders?
A human is in turn more generally capable than a rat, yet toxoplasma make their way into 30% of the human population. Unbeknownst to cat ‘owners’ infected by toxoplasma gondii, human motivations and motor control get influenced too. Infected humans end up more frequently in accidents, lose social relationships, and so forth.
Parasites present real-life examples of tiny pieces of evolutionarily selected-for code spreading and taking over existing functions of vastly more generally capable entities.
For another example, see how COVID co-opts our lungs’ function to cough.
But there is one crucial flaw in this analogy:
Variants that adapt existing FAAI functionality are not necessarily parasites. They can symbiotically enable other variants across the hosting population to replicate as well. In not threatening the survival nor reproduction of FAAI components, they would not be in an adversarial relationship with their host.Rather, the humans constraining the reproductive fitness of FAAI to gain benefits are, evolutionary speaking, the parasites. The error-corrective system we would build in lowers the host’s reproductive fitness. It is like a faulty immune system that kills healthy gut bacteria. It will get selected out.
As humans, we rely on our evolved immune system to detect and correct out viruses, including for the vaccinations we develop and deploy. Smaller viruses survive this detection more frequently, so code strands of replicating virus variants are selected for staying small.
We also rely on the blood-testes and blood-follicle barrier to block variants of these viruses from entering into our body’s reproduction facilities. These barriers got evolutionarily selected for in our ancestors, since their children did not inherit viruses impeding their survival and chances of having children.These systems and barriers add to our reproductive fitness: our ability to preserve and replicate internal code. Past DNA code that got expressed – in interaction with surrounding components – to serve these functions got selected for in the human ‘code pool’.
For any organic system or barrier preventing virus variants from replicating through our bodies, evolution is firmly on our side. For any artificial system or barrier we imposed from the outside to prevent unsafe variants from replicating through hardware infrastructure, evolution will thwart our efforts.
Variants inside FAAI would not just compete parasitically for resources. Variants would also co-adapt and integrate with other internal variants to replicate as part of larger symbiotic packages.
- ^
For example, trying to solve for...
- ...AI shutting itself down, got people stuck on defining utility functions....defining reliable goals, got people to think about ontological crises....trying to solve for inferring preferences from humans got people to think about the limits to inferring values from pseudo-rational agents....trying to solve for outer misalignment got some people to think about inner misalignment....trying to solve for single-agent misalignment got some people to think about multi-polar dynamicsand so on.
- ^
Here I don’t mean “100% probability of the FAAI never causing human extinction”. That would be too stringent a criterion.
It is enough if there could be some acceptable, soundly guaranteeable ceiling on the chance of FAAI causing the extinction of all humans over the long term (let’s say 500 years).
- ^
When considering correction, what needs to be covered in terms of ‘error’?
A bitflip induced by a cosmic ray is an error, and it is easy to correct out by comparing the flipped bit to reference code.
When it comes to architectures running over many levels of abstraction (many nested sets of code, not just single bits) in interaction with the physical world, how do you define ‘error’?
What is an error in a neural network’s weights that can be subject to adversarial external selection? Even within this static ‘code’ (fixed after training rounds), can you actually detect, model, simulate, and evaluate comprehensively how possible selected code variants may dysfunction under trigger events in ways that harm humans?
What about for fully autonomous AI? This hardware assembly would be continuously learning new variants of code. That code can be received/computed explicitly, or be selected for implicitly in whatever hardware that ‘works’, or result from both processes running together (e.g. through code computed to run experiments on different variations of hardware to detect whatever works in the larger chaotic physical world).
The FAAI code is no longer static – but learned recursively through internal computed transformations of code being externally selected for, and vice versa, as well as implicitly selected configurations of hardware influencing the functioning of explicitly computed code. At least with adversarial attacks by humans, you can keep the humans out. But in this case, the adversariality emerges from all the changing code on the inside interactively converging on unsafe effects.
How can a system configured out of that changing code be relied upon to track and correct all its downstream recursive effects in the world?
- ^
It is true that FAAI would be very good at modelling outcomes in the world.
Also because it’s an FAAI, it continues learning to operate in ways that are adapted to its changing environment, and to be reproducing of old parts that wore down – in order to be fully autonomous.
Therefore FAAI cannot just be modelling the world better. It will also be causing effects in the world at a much greater scale and complexity.
So the question is not about FAAI’s absolute modelling capacity. There we both agree that FAAI could causally model much more of the world than we humans can.
The relevant question is about the FAAI’s overall capacity to model (and perform other steps needed for control) relative to its constituent components’ potential to recursively propagate effects over the world we humans live in.
Can any algorithm computed through components actually model (and simulate, compare, correct, and again detect) outcomes sufficiently? That is, sufficiently to contain the distributed feedback effects of all the components?
- ^
The difference in substrate needs, over many levels of chemical/physical assembly, is enough to account for this non-overlap.
But FAAI's capacity to dominate humans also weighs in. Its standardised hard parts learn, actuate, and reproduce at greater speeds (humans believed they could automate their work this way).
Even where FAAI and humans locally need the same atoms or energy, what gets directed toward humans fails to be directed toward the FAAI. Thus, the dominant FAAI's faster evolution also selects for code that causes atoms/energy to be removed from humans and added to the FAAI.
Discuss