Agent foundations: not really math, not really science

Published on August 17, 2025 5:48 AM GMT

These ideas are not well-communicated, and I'm hoping readers can help me understand them better in the comments.

The classical model of the scientific process is that its purpose is to find a theory that explains an observed phenomenon. Once you have any model whose outputs matches your observations, you have a valid candidate theory. Occam's razor says it should be simple. And if your theory can make correct predictions about observations that hadn't previously been made, then the theory is validated.

The classical model of mathematics is that you start with axioms and inference rules, and you derive theorems. There is no requirement that the axioms or theorems need to reflect something in reality to be considered mathematically valid (although they almost always do). Mathematicians have intuitions about what theorems are true before they prove them, and they have opinions about what theorems are important or meaningful, based partly on aesthetics.

What I^[1] am trying to do with agent foundations is not really either of these, and I think this is one reason why many people don't "get" agent foundations. We're trying to understand a phenomenon in the real world (agents), but our methods are almost exclusively mathematical (or arguably philosophical). The nature of the phenomenon is substrate-independent, and so we don't need to interact directly with "reality" to do our work. But we're also not totally sure which substrate independent thing it is, so we're still working out what mathematical objects are the right ones to be working with.

I do think this makes it a harder type of research. I just also think it's the type of research we have to do to get a good future.

Empirics

This mismatch becomes especially salient when considering its relationship to empiricism. People sometimes ask (understandably!) agent foundations researchers what experiments they plan to do. And sometimes people imply that because the field is not doing experiments, it is probably detached from reality and not useful. I have found these interactions awkward and unsatisfying for both parties, I think because we don't have a shared concept for me to refer to, somewhere between science and math.

From where I'm standing, it's hard to even think of how experiments would be relevant to what I'm doing. It feels like someone asking me why I haven't soldered up a prototype. That's just... not the kind of thing agent foundations is. I can imagine experiments that might sound like they're related to agent foundations, but they would just be checking a box on a bureaucratic form, and not actually generated by me trying to solve the problem.

I spend my time reading math books, pacing around thinking really hard, and trying to formulate and prove theorems. I am regularly accessing my beliefs about how the ideas can eventually be applied to reality, to guide what math I'm thinking about, but at no point have I thought to myself "what I need now is to run an experiment". The closest thing I do is when I search for whether people have already written papers about the ideas I'm developing, or sanity-checking my thoughts by talking to other researchers.

What makes agent foundations different?

One thing that makes agent foundations different from science is that we're trying to understand a phenomenon that hasn't occurred yet (but which we have extremely good reasons for believing will occur). I can't do experiments on powerful agents, because they don't exist. And of course, the whole point here is that they're fatally dangerous by default, so bringing them into existence would not be worth the information gotten from such an "experiment". I also cannot usefully do experiments on exiting AI models, because they're not displaying the phenomenon that I'm trying to understand.^[2]

With normal science, there's a phenomenon that we observe, and what we want is to figure out the underlying laws. With AI systems, it's more accurate to say that we know the underlying laws (such as the mathematics of computation, and the "initial conditions" of learning algorithms) and we're trying to figure out what phenomena will occur (e.g. what fraction of them will undergo instrumental convergence).

So, I don't think that what we're lacking is data or information about the nature of agents -- we're lacking understanding of the information we already have. The reason I'm not thinking about experiments is because I don't feel any pull toward gaining more information of that type. I'm not confused in a way where looking at something in the territory will resolve my confusion. I believe the answers to my research questions are already contained within what we know, in the same way that the truth-value of conjectures is already contained within the logic, axioms, and definitions.

If we were trying to figure out chemistry and material science, then we absolutely would need tons of information, because our everyday observations are simply insufficient information to pin down the true theory of matter. There are tons of ways that the underlying laws of physics of stuff could be, and you can't simply figure it out by thinking about it.

But I don't think that's true for agents. I'm not saying that I think I could have been born in an armchair and then do nothing but think until one day I eventually understand agents. But I am saying that the decades of my life that I've already lived, combined with intensive interactions with other researchers, are sufficient real-world information for me to have about agents.

It's kinda like "computer science"

For some reason, the field that studies the mathematics of computation ended up being called computer science. This might be non-coincidentally related to what I'm trying to express about agent foundations. Computation is substrate-independent, so after we figured out the definition of computation which usefully captured the phenomenon we wanted to engineer, we no longer had to check with reality about it to make progress on important questions.

I don't think that Archimedes could have figured out basically any results of computability theory. This is despite the fact that, in "theory", one could figure that all out by thinking. (He even had humans as examples of general-purpose computers.) But that's not really sufficient. One needs to have some kind of life experiences that points your mind toward the relevant concepts, like "computing machines" at all. But once someone has those, they don't necessarily need more information from experiments to figure out a bunch of computability theory. I think if people in Charles Babbage's era had decided that we needed to grok the nature of computation-in-general in order to save the world, then they could have done it, and done so without figuring out transistors or magnetic memory or whatever. It's noteworthy that humanity did indeed deliberately invent the first Turing-complete programming languages before building Turing-complete computers, and we have also figured out a lot of the theory of quantum computing before building actual quantum computers.

When Alan Turing figured out computability theory, he was not doing pure math for math's sake; he was trying to grok the nature of computation so that we could actually build better computers. And he was not doing typical science, either. He obviously had considerable experience with computers, but I seriously doubt that, for example, work on his 1936 paper involved running into issues which were resolved by doing experiments. I would say agent foundations researchers have similarly considerable experience with agents.

We need a lot of help

I'm pretty sure that there's a nature-carving concept of non-fooming powerful optimizers, and corrigible agents, and other things, and that if we figure them out, we can navigate the future more safely. And I'm pretty sure it doesn't make sense for me to do experiments to figure it out. Instead I have to learn or invent enough math to have the right concepts, and then prove theorems about them, and that will help enable us to build said safe optimizers.

Cosmologists were able to construct an unimaginably precise and deep theory of the origin of the universe, despite never being able to perform interventional experiments. Nuclear physicists were able to get the first nuclear reactions and detonations (mostly) right on the first try.

Maybe if we can get as many agent foundations researchers as there were nuclear physicists or cosmologists, we can collectively discover as much understanding about the nature of agents to navigate to a good future.

^{^}
I say "I" here only because I don't want to put words in the mouths of other agent foundations researchers. My sense is that what I'm saying here is true for the whole field, but other researchers should feel free to chime in.
^{^}
Other sub-fields of AI safety can usefully do experiments on existing models, because they're asking different questions (like "how can we interpret existing models?" and "in what ways are existing models dangerous?"). This research is much more like a standard science, and that's great! AI safety needs a million different people doing a million different jobs. I think agent foundations is one of those jobs.

Discuss

Empirics

What makes agent foundations different?

It's kinda like "computer science"

We need a lot of help

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签