Modeling versus Implementation

Published on May 18, 2025 1:38 PM GMT

Epistemic status: I feel that naming this axis deconfuses me about agent foundations about as much as writing the rest of this sequence so far - so it is worth a post even though I have less to say about it.

I think my goal in studying agent foundations is a little atypical. I am usually trying to build an abstract model of superintelligent agents and make safety claims based on that model.

For instance, AIXI models a very intelligent agent pursuing a reward signal, and allows us to conclude that it probably seizes control of the reward mechanism by default. This is nice because it makes our assumptions fairly explicit. AIXI has epistemic uncertainty but no computational bounds, which seems like a roughly appropriate model for agents much smarter than anything they need to interact with. AIXI is explicitly planning to maximize its discounted reward sum, which is different from standard RL (which trains on a reward signal, but later executes learned behaviors). We can see these things from the math.

Reflective oracles are compelling to me because they seem like an appropriate model for agents at a similar level of intelligence mutually reasoning about each other, possibly including a single agent over time (in the absence of radical intelligence upgrades?).

I'm willing to use these models where I expect them to bare weight, even if they are not "the true theory of agency." In fact (as is probably becoming clear over the course of this sequence) I am not sure that a true theory of agency applicable to all contexts exists. The problem is that agents have a nasty habit of figuring stuff out, and anything they figure out is (at least potentially) pulled into agent theory. Agent theory does not want to stay inside a little bubble in conceptual space; it wants to devour conceptual space.

I notice a different attitude among many agent foundations researchers. As I understand it, MIRI intended to build principled glass-box agents based on Bayesian decision theory. Probably as a result, it seems that MIRI-adjacent researchers tend to explicitly plan on actually implementing their theory; they want it to be executable. Someday. After a lot of math has been done. This isn't to say that they currently write a lot of code - I am only discussing their theory of impact as I understand it. To be clear, this is not a criticism; it is fine for some people to focus on theory building with an eye towards implementation and others to focus on performing implementation.

For example, I believe @abramdemski really wants to implement a version of UDT and @Vanessa Kosoy really wants to implement an IBP agent. They are both working on a normative theory which they recognize is currently slightly idealized or incomplete, but I believe that their plan routes through developing that theory to the point that it can be translated into code. Another example is the program synthesis community in computational cognitive science (e.g. Josh Tenenbaum, Zenna Tavares). They are writing functional programs to compete with deep learning right now.

For a criticism of this mindset, see my (previous in this sequence) discussion of why glass-box learners are not necessarily safer. Also, (relatedly) I suspect it will be rather hard to invent a nice paradigm that takes the lead from deep learning. However, I am glad people are working on it and I hope they succeed; and I don't mean that in an empty way. I dabble in this quest myself - I even have a computational cognitive science paper.

I think that my post on what makes a theory of intelligence useful suffers from a failure to make explicit this dichotomy between modeling and implementation. I mostly had the modeling perspective in mind, but sometimes made claims about implementation. These are inherently different concerns.

The modeling perspective has its own problems. It is possible that agent theory is particularly unfriendly to abstract models - superintelligences apply a lot of optimization pressure, and pointing that optimization pressure in almost the right direction is not good enough. However, I am at least pretty comfortable using abstract models to predict why alignment plans won't work. To conclude that an alignment plan will work, you need to know that your abstract model is robust to vast increases in intelligence. That is why I like models similar to AIXI, which have already "taken the limit" of increasing intelligence - even if they (explicitly) leave out the initial conditions of intelligence-escalation trajectories.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签