Is theory good or bad for AI safety?

Published on January 19, 2025 10:32 AM GMT

We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard. (Kennedy’s famous “We chose to go to the moon” speech)

The ‘real’ mathematics of ‘real’ mathematicians, …, is almost wholly ‘useless’ (Hardy’s “A Mathematician’s Apology”)

If the "irrational" agent is outcompeting you on a systematic and predictable basis, then it is time to reconsider what you think is "rational". (Yudkowsky’s “Rationality is Systematized Winning”)

Shut up and calculate (Merman, apparently)

I have been writing a long post about the modeling theory in different sciences, specifically with a focus on elegance/pragmatism tradeoffs and how to reason about them in the context of safety. It has ballooned up (as these things tend to do), and I'm probably going to write it in a few installments as a sequence.

But before going in, it's worth explaining why I think building better models and a better language here is crucial.

First, let's answer the question. Is theory good or bad?

If I were to summarise my position on this in one paragraph, it would be “it’s complicated”, with a Fiddler on the Roof-style sequence of ‘on the other hands’ to follow.

On the one hand

On the other hand

And so on.

When I talk to my team at PIBBSS and my friends in AI safety, we have interesting, nuanced debates. My teammates have written about related things here, here and here. But when I look around, what dominates the discourse seem to be very low-context discussions of “THEORY GOOD” or “THEORY BAD”. Millions of dollars in funding are distributed on the premise of barely nuanced versions of one or the other of these slogans, and I don’t like it.

On the one hand, this isn’t an easily fixable situation where someone can just come in and explain what the right takes are. Questions about theory in AI are hard to reason about for a number of reasons.

Getting enough context in a single paradigm to evaluate it takes a significant amount of research and reading, and it gets harder to do it (and even harder to then communicate the results) as the paradigm becomes more theoretical.Attribution is hard in science. It’s not entirely clear what it means that some idea or concept was “useful”. What parts of your cellphone would and would not exist without the theory of relativity? What is the counterfactual impact on modern biology of Darwin’s theory of evolution, and how would it be different if the discovery was made 50 years later? Etc. Relatedly, when evaluating the merits of a theoretical agenda, there are a lot of things to track. Questions of pragmatism, pluralism, elegance, etc., might quickly turn into an interconnected mess that’s hard to entangle and turn into a clear take. In established fields like math and physics, there has been an accumulation of institutional knowledge about “what is good theory vs. what is bad theory” and “when to try to build more fundamental models vs. when to shut up and calculate”. Not so in AI – while the field existed for a while as a subdiscipline of theoretical CS, the modern hypersonic development of the field and the empirical tools and complex behaviors we can study means that all these intuitions need to be rebuilt from scratch.

But on the other hand, the really awful state of the debate and the low "sanity waterline" in institutional thinking about theory and fundamental science is surprising to me. There are extremely low-hanging fruit that are not being picked. There are useful things to say and useful models to build. And when I look around, I don’t see nearly as much effort as I’d like going into doing this.

What we lack here is not so much a "textbook of all of science that everyone needs to read and understand deeply before even being allowed to participate in the debate". Rather, we lack good, commonly held models of how to reason about what is theory, and good terms to (try to) coordinate around and use in debates and decisions.

The AI safety community, having much cultural and linguistic overlap with the lesswrong community (e.g. I am writing this here), has a lot of the machinery for building good models. I really liked the essays by Yudkowsky on science and scientists, like this one. I also really like the linked initiatives by Elizabeth Van Nostrand and Simon deDeo's group on trying to think more rigorously about path-dependence and attribution in the history of science (and getting my favorite kind of answer: it's complicated, but we can still kinda build better models).

I think there should be more work of this type. But at the same time, as I mentioned before, I think this community has as bit of an issue with reductionism. This biases the community to reduce the core concepts in building theory to something mathy and precise -- "abstraction is description length" or "elegance is consilience". While these constitute valuable formal models and intuition pumps, they do not capture the fact that abstraction and elegance is its own kind of thing, like the notion of positional thinking in chess -- they're not equivalent to formal models thereof. Now I'm not about to say that there is some zen enlightenment that you will only attain once you have purified yourself at the altar of graduate school. These notions can be modeled well, I think, without having the lived experience, in the same way that a chess player can explain how she balances positional and tactical thinking to someone who does not have much experience in the game. A good baseline of concepts to coordinate around here is possible, it just hasn't (to the best of my knowledge) been built or internalized.

I want to point at Lauren's post here in particular as a physics perspective on the notion of "something being physical" in valuable and non-reducible inherent notion that is useful and can contribute to better conceptualization here.

In the next couple of posts in this sequence I am hoping to build up a little more of such a language. I'm aware that I'll probably be reinventing the wheel a lot, and what I'll be giving is a limited take. The hope is that this will start a conversation where more people, perhaps with better ways of operationalizing this, will start coordinating on filling this gap with a bit of a consensus vocabulary.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签