Published on August 5, 2025 12:05 AM GMT

Dr. @Steven Byrnes is one of the few people who both understands why alignment is hard, and is taking a serious technical shot at solving it. He's the author of these recently popular posts:

Foom & Doom 1: "Brain in a box in a basement"

Foom & Doom 2: Technical alignment is hard

After his UC Berkeley physics PhD & Harvard postdoc, he became an AGI safety researcher at Astera. He's now deep in the neuroscience of reverse-engineering how human brains actually work, knowledge that could plausibly help us solve the technical AI alignment problem.

He has a whopping 90% P(Doom), but argues that LLMs will plateau before becoming truly dangerous, and the real threat will come from next-generation “brain-like AGI” based on actor-critic reinforcement learning.

We cover Steve's "two subsystems" model of the brain, why current AI safety approaches miss the mark, Steve's disagreements with "social evolution" alignment approaches, and why understanding human neuroscience matters for building aligned AGI.

Video

Podcast

Listen on Spotify, import the RSS feed, or search "Doom Debates" in your podcast player.

Transcript

Cold Open

Liron Shapira: Your priority is to solve technical AGI alignment, correct?

Steven Byrnes: Yeah. I would like there to be a plan, a technical plan such that somebody has some idea about what kind of code to write, where they write that code, and the AGI isn't trying to kill its programmers and kill its users and kill everybody else.

Introducing Steven Byrnes

Liron: Welcome to Doom Debates. Dr. Steven Byrnes is an artificial intelligence safety researcher at the Astera Institute. He has a BA in physics and math from Harvard. He did his PhD in physics at UC Berkeley.

He wrote his dissertation on solar cells, laser physics, and physical chemistry. He also did a physics postdoc at Harvard where he did research in thermodynamics and optics, and then from 2015 to 2021, he worked at a nonprofit applied physics research and development laboratory called Draper.

Then since 2021, he's been doing AGI safety research first independently and now at the Astera Institute. Now you're probably thinking this guy sounds like an average guy. This guy is in fact legit.

He was a winner of the USA Math Olympiad and the USA Physics Olympiad in high school. He won first place nationally in the Siemens Westinghouse competition with a math project. So it's always great when I have a guest who's bringing in that raw intellectual horsepower and achievement.

I'm blown away by his body of research because his write-ups, which I encourage you guys to check out, are incredibly brilliant and thorough, and I agree with virtually one hundred percent of his claims.

I invited him on Doom Debates to ask him the same question that I asked Jim Babcock a few weeks ago: What's the mainline doom scenario? We are gonna talk about Steve's model of what true AGI will look like, how soon it's coming, what his mainline doom scenario is, and how to actually solve AI alignment on a technical level.

Steve Byrnes, welcome to Doom Debates.

Steven: Thanks for having me.

Liron: Alright, so really a lot to get to. I've spent the last couple weeks really diving into the Steve Universe and overall I'm just so impressed. I feel like you are Leibniz to Yudkowsky's Newton - you guys are independently thinking along a lot of the same lines.

Each of you kind of reinforces the other, but it seems like you came at it from independent directions. Is that fair to say?

Steven: I think Yudkowsky has had good ideas and I try to internalize and copy the good ideas when I notice them. I don't claim that everything that I've written is something that I independently made up.

Liron: Fair enough. So I just want to dive a little bit more into what you see are your mental strengths. It does seem like you've got a combination of skills that I describe as mental superpowers.

Because not only are you very good at math, but then you combine it with independently doing research and synthesizing a bunch of concepts and writing a synthesis that I think makes a lot of sense. So what do you see as your unique mental superpower?

Steven: Maybe other people should decide for themselves what I'm good and bad at. I do think I have a personality that's well suited to my current job.

I hate hands-on work and I'm really bad at it, and I don't have to do that. I hate management and mentorship and I'm really bad at it, and I don't have to do that either.

I do like patiently arguing on technical points online for extended periods of time. I was doing that as a Wikipedia editor 20 years ago. I've been doing it ever since in lots of my free time.

So that seems to be something that I take to very well. And I sometimes listen to other independent researchers struggle with some aspects of the job that seem to come very easily to me, so I'm lucky in that respect.

Liron: That's right. So I saw you've been spending decades helping explain stuff clearly online. And a lot of the Wikipedia articles that are about physics have a lot of paragraphs contributed by you, right?

Steven: Yeah. I wrote a whole lot of Wikipedia physics articles. Of course, they've all been totally bastardized since I last edited them. I haven't really edited for the last 10 years.

Liron: Well, but you gave them a good start to run with.

Steven: Yep.

Liron: Well, I was impressed when I read your history of researching cold fusion. You write on your website: "Everyone knows there's no such thing as cold fusion, but I didn't want to take anyone's word for it. I wanted to check for myself."

"Thus, I made a hobby out of carefully reading the papers and studying the arguments of cold fusion proponents. After 29 blog posts over four years, I concluded that the mainstream consensus is correct. There is indeed no such thing as cold fusion."

"Meanwhile, the blog was a nice excuse to write lots of fun pedagogical posts about quantum optics, nuclear physics, relativistic quantum mechanics, Bose-Einstein condensation, statistical mechanics, and much more."

So when I read that, I'm like, okay, well there's definitely a lot of unique factors coming into play in order to be able to do an exercise like this. Because most of us would be like, okay, let me just Google if cold fusion real. Seems like probably not. Okay, I'm out.

That's like me, but you're like, oh, let me just learn a little bit about relativistic quantum mechanics and write 29 blog posts about that and related topics.

So for me, that's like, okay, well you've got the baseline math and logic talent. You're good at learning, right? Diving into this independent research - you want to go see for yourself why something is true instead of just learning from a textbook.

And then you've got the productivity aspect where you'll actually go write something up. You won't take forever to write a post. It seems like you've got high output going.

Is that a good breakdown of your skillset?

Steven: Yeah, that sounds about right.

Liron: The other thing that I'm seeing is prioritization. So it seems like you have good taste. You go into a field and you're not just gonna get bogged down in a random aspect of it. You're like, okay, well these seem like the big ideas. Let me focus on those.

And again, that's also something that I think Yudkowsky has. Do you notice that in yourself where you're like, oh yeah, this is what's important, and other people are wasting their time?

Steven: Ultimately we're gonna find out whether AI kills everybody or not. And from the pearly gates we can decide whether my work was productive or not.

So that's a little bit terrifying in that respect. I'm doing the best I can. I'd like to think that I'm working on what's important and making progress.

I'm sure a lot of, hopefully most other people would say the same thing about their own work. Yeah, time will tell.

Path to Neuroscience and AGI Safety

Liron: Fair enough. Okay. The other piece that I've noticed is even though your background is in physics and math, it seems like you've gotten really knowledgeable about neuroscience. So how did that happen?

Steven: Yeah, so after I finished my last hobby, which was this cold fusion blog that you mentioned, I was ready for a new hobby. This was like 2018, 2019, and I decided my next hobby was gonna be AGI safety.

I didn't really have high expectations for it. I figured I could at least leave comments on other people's blog posts, produce some pedagogy or something. And so I started getting into that.

I wrote a few blog posts on this newfangled thing called GPT-2, but eventually got interested in this thing where I heard a podcast where I think Jeff Hawkins went on, what was it, Lex Fridman or something, and he said:

I, Jeff Hawkins, understand the brain, and turns out that the secret sauce of human intelligence is the neocortex, which is this 70% or so of the brain that includes vision areas and motor control, reasoning, language - all these other things.

And he says, it turns out that the whole cortex is more or less a uniform learning algorithm. This is called Cortical Uniformity. It's this theory from the 1970s, and he said it's just this big randomly-initialized learning algorithm. (I'm not sure if he would've put it that way, but that's how I put it.)

And Jeff Hawkins said he's well on his way to understanding exactly how that learning algorithm works and that he's gonna invent AGI, and it's gonna be awesome.

So that was Jeff Hawkins' pitch.

Liron: When did you first hear Jeff Hawkins say that stuff? When did you get into him?

Steven: This must have been 2019, I think.

Liron: You know, that's interesting because I actually got into Jeff Hawkins myself in 2006 just listening to his TED Talk and then "On Intelligence" came out around then and I'm like, yeah, hell yeah. This guy's got it.

But unlike you, I'm just like, okay, well let me just go back to being a software engineer now, but you're like, oh, let me actually go deeper.

Steven: Yeah. Andrew Ng supposedly was inspired by that to go into deep learning. He saw the light, that large-scale learning algorithms could do a lot if they're scaled up.

So, yeah, the cortex has, what is it, 100 million cortical minicolumns, and they're all pretty similar. They all have, well, they generally have six-ish layers. Some of them have five and some of them have more than six. There's sublayers, whatever.

But anyway, it's this whole six-layered motif that's repeated over and over again. That's sort of vaguely like what eventually happened with large-scale learning algorithms in deep learning.

But of course, Jeff Hawkins would say, no, no, no. That's the wrong kind of learning algorithm. The brain is different from deep learning.

Liron: And is he correct?

Steven: I do think that Jeff Hawkins is correct that the brain - that the cortex learning algorithm is pretty different from deep learning. I don't think that Jeff Hawkins' particular ideas are all on the right track or that he's on the cusp of figuring it out.

But just the idea that the cortex is a big learning algorithm was very interesting to me because it did seem like nobody else in AGI safety at the time was asking the question of, okay, what if he's right, what would be the implications for safe and beneficial AGI? And also, is he right? Should we believe him?

So that got me interested in that question. It also leads to a bunch of related neuroscience questions. Namely, if the cortex is a big learning algorithm, how do you explain everything in developmental psychology, everything in evolutionary behavior - how does all that work?

And there's just lots of follow up questions that nobody seemed to be asking. What's the big picture? What's the bigger picture in which this cortical uniformity ingredient fits?

So I sort of got curious about that and started teaching myself neuroscience over the past five, six years from that starting point first in my free time and then when I got a full-time job, I was able to spend more time on it.

Liron: Cool. Okay. So now, so at this point I feel like, even though I can't say neuroscience is your credential, it certainly seems like you're as conversant in neuroscience as somebody with a bachelor's, master's. Is that fair to say?

Steven: Yeah, I think that's fair to say. I'm happy to take on anybody in a neuroscience quiz bowl.

Liron: Yeah, we gotta get you in the neuroscience Olympiad.

Steven: I have lots of idiosyncratic opinions about neuroscience. I would probably flunk the quiz bowl by giving answers that are wrong according to the textbooks, but correct according to me.

Liron: I know what you mean. I mean not, I don't have neuroscience opinions. It's not my field of study, but I know what it looks like when a field has everybody running in one direction where it's like, what the hell guys?

I've seen in a couple fields, you know, quantum theory. I mean, are you one of those people who are like, obviously the many worlds interpretation is correct?

Steven: Yeah, I am one of those people.

Liron: Yeah. Same here. Same here. When did you first encounter Eliezer Yudkowsky's sequences and when did you become AGI-pilled?

Steven: I heard about AGI now and then over the course of the 2010s. I think I read Superintelligence in 2016, a couple years after it came out.

But at the time I was like, this is an AI problem and I'm a physicist, so hopefully other people will deal with it because I'm doing my thing as a physicist. What do I know about AI?

But over time, I did actually get to learn a little bit of AI through my job. And I also did a little bit of brain computer interface and other work that's a little bit related.

Then when it was time to pick this new hobby in 2018 or 2019, I figured maybe I'll give it a go. So that would've been the time that I started reading seriously, like a lot more writing on that topic and reading the massive trove of work by Yudkowsky was one part of that.

Liron: You did actually go ahead and read through the whole sequences, right?

Steven: I've probably read virtually all of it by this point.

Liron: Did you then read Harry Potter and the Methods of Rationality?

Steven: I think I read that much later.

Liron: Did you then go and read his various BDSM, other types of fiction?

Steven: I have not read that.

Liron: I haven't actually gotten through that myself, so I can't say that I've read the full Yudkowsky corpus, but I have read 300% of the sequences in that I just keep reading them over and over again. I find them very insightful.

So that's good background about where you're coming from and how you got AGI-pilled. And when we say you're AGI-pilled, just to state the obvious, it just means that you agree that AGI is potentially in the near term, not even the long term in the near term, a crazy, huge deal. Like the biggest shakeup that the planet has ever seen, maybe ever.

Steven: Well, that depends on how near is near. There's a thing in the air that if you don't think AGI is gonna arrive before my afternoon coffee, then that's long timelines.

In this day and age, I don't think, I don't expect AGI this year. I don't expect AGI next year. I think that I don't know what Dario Amodei is smoking that he thinks that we have AIs doing Nobel-Prize-winning original research in 2027.

But if we're talking about what's gonna happen in my lifetime, then for sure, I think AGI is very likely and is a very, very, very big deal.

Liron: I think I agree with you, but I also think we should clarify when you say what is Dario smoking - Nobel research by end of 2027. Okay, but what probability do you give that? Obviously not zero, probably at least 1%, right?

Steven: I don't know, low enough that it doesn't factor into decisions about what I do on my day-to-day life.

Liron: So very roughly, like one to 10%?

Steven: I guess there's a way that people use probabilities in practice, which is there's the mainline, the thing that you're kind of centrally expecting, the maximum a posteriori scenario that you're expecting to live.

And then you have sort of abstract ideas of other things that might happen and ideally you take actions where those other branches, you don't want to be screwing over the world in those other branches. But you're not necessarily living in that scenario.

And then there's other things that you just view as so improbable that it's okay to do things that are counterproductive if those things are true.

I'm happy to report that I don't think I'm screwing anybody over in the hypothetical world where Dario Amodei is right about AGI arriving in, super intelligence coming in, the next couple years.

Because of that, I don't feel like I need to think too hard about exactly what probability I assign to it, other than pretty low.

Research Direction and Brain-like AGI

Liron: Alright, let's talk about your research direction. So the 10,000 foot view, very high level, it seems like your priority is to solve technical AGI alignment, correct?

Steven: Yeah. I would like there to be a plan, a technical plan such that somebody has some idea about what kind of code to write, where they write that code, and the AGI isn't trying to kill its programmers and kill its users and kill everybody else.

Liron: It's rare to find a guy who actually understands the technical issues and is actually saying, okay, yeah, I understand this, but I'm going to try to solve it.

That was basically what MIRI was trying to say before, and now they've kind of given up. I mean, they still have some researchers doing research, but they're basically like, yep, we decided the problem's intractable.

Whereas you're here being like, well, I understand MIRI's ideas. I understand why it's difficult. I think most of the ideas are true, but I'm here trying to solve it.

Steven: Yep. I'm trying and I don't claim to have a viable plan as of right now, so time will tell whether I'm over optimistic or not.

Liron: Do you agree with my assessment that when you look around the landscape, you see other people like, oh, they work for AI safety as part of one of these AI companies, but they're not on the same page about all the Yudkowskian problems, right?

So they're not even seeing important obstacles in the field that they think that they're navigating, and then you do, but you're still trying to solve it.

Steven: Yeah, I think I agree with that. There's a very common position these days that involves sort of extrapolating alignment of current large language models to alignment of future powerful AGI.

And I think they're gonna be sufficiently disanalogous that that research program is gonna wind up - people are gonna be unpleasantly surprised about how future AI turns out.

I think I'm a little more sympathetic to “it's an understandable mistake” kind of thing, but I do think they're wrong. So I guess I agree on that level.

Liron: Yeah, I know what you're getting at. So for the viewers, I think what's happening is you've got a lot of these people at the AI companies, and I think Sam Altman pretty much said it.

He's like, yeah, all these people who are AI doomers, 10 years ago they gave us all these warnings and then we built LLMs and we got to a feature that they couldn't even imagine - computers talking to us without killing us. The doomers never imagined this, so you gotta just stop listening to the doomers.

That's essentially the perspective of a lot of these people at these AI companies. Dismiss the doomers now because LLMs are nothing like the doomers ever imagined.

And then you and I are on the same page of like, okay, well LLMs are very interesting, but they're ultimately a pit stop on the road to the same doomy future that the doomers are warning about.

Steven: Yeah, that's my take. We don't have the really powerful kind of AI yet, so the jury's out on what it will be like and whether it will be as omnicidal as I and the other doomers have been saying.

Liron: And we're gonna dive into that because you have a lot of interesting ideas about why that's going to be the case, why LLMs aren't the same thing as the super intelligent AI that's coming.

Before we dive into that, just going back to your research program and your goal to solve technical alignment, I want to kind of spoil the ending and tell people what you've landed on on your research direction, which is basically, quoting your own words, reverse engineering human social instincts on a neuroscience brain algorithm level. Correct?

Steven: Yeah. I think we should think of the human brain as doing some kind of model-based reinforcement learning.

And it has a reward function that says pain is bad and that eating when you're hungry is good, and so on. These are sometimes called innate drives, or primary rewards, primary reinforcers, primary punishers, depending on what field you're in.

And I think these are basically some innate set of drives that makes humans like to do some things and dislike other things.

And a subset of that is human social instincts that make us want to do what the cool kids are doing in school, that makes us want to feel compassion in some cases, spite in other cases, and bloodlust in other cases. So we have this whole suite of social instincts, for better and for worse.

And I think that when we make powerful AI, my central expectation is that it's gonna be a more similar architecture to how the human brain works. So by the same token, there will be a reward function and the programmers get to put whatever they want into that reward function.

So if we want the AI to feel compassion, then it would be a good jumping off point to understand how the human brain builds compassion.

The Two Brain Subsystems

Liron: Let's talk about some of the core discoveries that you think - maybe they're not fully consensus yet, but they make a lot of sense. We're gonna talk about the core discoveries that you've made on your way to solving this alignment problem.

Starting with this idea of the major two brain subsystems: steering brain versus learning from scratch brain. Explain that.

Steven: Okay. Yeah. Well, I wouldn't call it a discovery rather than a synthesis of things that have already been said.

But yeah, I think that it's helpful to sort of divide the brain into two pieces. And one piece is doing large-scale within-lifetime learning algorithms, and the other piece is doing - I call it the steering subsystem, and it's full of what I call business logic.

So business logic is this cool term from software engineering. So for example, if you're making tax software, it might say something like: if you have more than 200 employees, then fill out form 529X and if you are registered in Texas, then attach form something or other.

So all that is things that you might want to write the code for, but the code that you're writing is directly related to the functional specification that you're trying to meet.

So if I'm evolution and I'm building a brain, I likewise need to have a lot of business logic. If the person is fertile, then their sex drive should go up. If the person is on the brink of starvation, then their pain tolerance should go up and they should feel a hunger signal and they should stop doing non-shivering thermogenesis processes that burn energy to ward off hypothermia and this and that and the other thing.

There's just a bunch of probably hundreds or thousands of effectively lines of code that evolution needs to build into the brain.

And I think that if you look at the hypothalamus and you look at the brainstem and a couple other parts of the brain, they're just full of these kinds of business logic.

Whereas if you look at the cortex, if you look at the cerebellum, the striatum, then what you see is these large scaled-up learning algorithms where I think the right way to think about them is that they start out randomly initialized, in effect. And then there's loss functions and there's hyperparameters and there's learning rules.

And over the course of a lifetime— They start out totally useless when they're first formed. But by the time you're an adult, your cortex is doing all kinds of useful things for you, and all the flexible learned behavior is coming from those systems.

Liron: My position as somebody who hasn't really studied neuroscience but will occasionally read a popular book, and obviously I've read your stuff, I think you're probably right. It's feeling right to me more likely than not. For what that's worth.

So let me review my understanding for the viewers basically recapping what you just said. So you've got the two brains, you've got the steering brain and the learning from scratch brain.

The steering brain has the brainstem and the hypothalamus primarily, right? 'Cause the brain has all these regions. And you're saying those two regions, which I think are also the closest to the spinal cord, right? Like the most primitive regions. Is that right?

Steven: So I dispute the term “primitive”. If you go back 500 million years to the first vertebrate, or our common ancestor with worms, they already had a steering subsystem and a learning subsystem, just as we do, but they were both much simpler than they are for us.

It's not like the human hypothalamus stopped evolving. We have different steering subsystems from chimps in many important ways. The things that make a chimp flinch are different than what make us flinch and so on.

Liron: Yeah. I think it makes intuitive sense that things that are developmentally earlier have more steering control, and maybe you can trace it all the way back to the genes and the DNA. Right. Those are the ultimate powerhouse, they're the ones who natural selection acts on and then they build bodies around them.

And then the later - the parts of the body that got built later, right? Like the brain, the cortex is evolutionarily new. And so it makes intuitive sense that, okay, the cortex came around in order to be the servant of the brainstem.

Steven: They’re like two gears in a car. I wouldn't say that either is the servant of the other.

Tiny worm-like creatures digging in the sand of the early ocean still need to learn to navigate their environment. They still need memory. They still need reinforcement learning. So they still need this within-lifetime learning, just because not everything about your situation can be specified by the genome in terms of reflexes.

Reflexes are good. It's almost always a good bet to flinch if you see a dark rapidly-expanding blob in your field of view.

But the worm can learn that there's a predator over here and there are no predators over there. And that can be different from where the predators were 20 years ago with its great grandparents. Those things where you're learning within your lifetime have always been important.

And we should think of those as quite primitive too. So again, the human cortex is a direct descendant of the vertebrate pallium. And probably even related to things like the mushroom body in fruit flies.

Liron: Okay. I guess I had this naive mental model where it's like, okay, the brainstem's first and then it grows the cortex out of it. But you're saying no, the cortex had an analog too that was also primitive.

Steven: Yeah. The cortex goes way, way, way back. They're all growing at the same time, out of the embryonic neural plate.

Liron: So zooming out again, so you got the steering brain, which is the brainstem and hypothalamus you claim, right? Not a hundred percent proven, but it seems like a good model.

Steven: And a couple other odds and ends. Yeah.

Liron: Exactly. And that's only 10% by volume, which to me makes intuitive sense because if you look at the amount of business logic in a program, a lot of times plumbing and libraries and sensors take up the majority of the program. And then the steering part is relatively few lines of code, right?

Steven: Yeah. So in humans in particular, the learning subsystem is more than 90%, I think well over 90% of the volume of the brain.

And I think of it as a little bit like GPT - there's a lot of code that people write, but they're not writing 10 times more code when they make the number of parameters 10 times bigger when they go from GPT-N to GPT-N+1.

There's just so many learned parameters that it swamps everything else. So by the same token, when we evolved from chimps, our cortex got three times bigger or something, and our striatum likewise got three times bigger.

So that's scaling up the learning algorithm and that allows humans to learn more within their lifetime and learn more rich and complicated models of the world.

But that can happen with much less evolutionary design than building three times more reflexes in the brainstem.

Liron: In your model of the world, your model of the brain, the steering brain, the business logic that it's trying to use to steer is actually traceable to genetic information and gene level learning across generations, which is this slow type of learning that really can't accumulate that many bits compared to the cortex where it's like the genes can say, okay, yeah, I want this twice as big.

And then it uses the twice as big cortex within a single lifetime to just populate it with a bunch of these dynamic learnings.

Steven: Yeah, exactly. Just like scaling up deep learning models makes them learn cooler and better things.

Liron: Exactly. Okay, so we talked about the steering brain and then the learning from scratch brain. It's the biggest region, I guess it's 90% of the brain by volume, correct?

Steven: More than 90%. I don't know the exact numbers. It depends on how you count.

Liron: The major subsections of the learning from scratch brain, according to you, would be the cerebrum, obviously, right? That's the most modern, biggest part by volume. But you also include the cerebellum, which might be an untraditional choice, right?

Because from the little bit I know about the cerebellum, I thought it was all about coordinating muscle movements.

Steven: Yep. But if you zoom into the cerebellum, what you find is sort of the same motif repeated over and over. This thing with Purkinje cells and climbing fibers.

And you find some ridiculously high number of granule cells - I think half the cells in the brain or something.

And I think the right way to think about it is that it's a big learning algorithm. It's not a particularly important one for AGI, but it is - and we can talk about exactly what I think it does, but I do think the right way to think about the cerebellum is as a big learning algorithm.

Liron: I feel like the cerebellum might be an LLM-like learning algorithm, right? Because if it's just trying to learn exactly how to coordinate muscle movements, isn't that the kind of thing that the LLM paradigm seems good at?

Steven: I don't think you should believe everything you read about the cerebellum. So the cerebellum is involved in motor control, but it's also involved in emotions and it's also involved in thinking.

And I think the right way to think about the cerebellum is that it is just this big expensive workaround for the fact that the brain is very slow.

So everything in the brain has ridiculously high latency because the signals are being carried around on these axons and not electronic signals on wires like we're used to with computer chips.

If you have all these crazy latencies all over the place, one thing you can do is to set a learning algorithm that just looks at some signal and says: the goal of this learning algorithm, the loss function, is to predict what this signal is gonna be doing in, let's say, 0.2 seconds.

And so you have a very clear loss function. Just wait 0.2 seconds and you'll see whether your guess was right or wrong.

And basically what you're doing is time-traveling that signal into the future by 0.2 seconds, at the cost of some noise. And that's often a good trade off.

Liron: Okay, this is clicking for me because first of all, you're saying the training, the reward function is just wait 0.2 seconds. Isn't that a lot like when you're training an LLM using unsupervised learning and then its signal can just be the next token?

Steven: Yeah. You could think of it as a kind of predictive learning, just as LLMs are doing predictive learning.

Liron: Okay. Well, I gotta believe this is true because cerebellum, the last three letters are almost LLM, so I really want to think of the cerebellum as the LLM part of the brain.

Steven: Yeah. I love it.

Liron: Kidding aside, it is - I am seeing a lot of connections here because you and I both agree that the brain is probably architecturally more than an LLM, but it also seems to me like there's this big turbo supercharger to the LLM.

Maybe a good analogy is like, think about rendering graphics. It's probably useful to have actual physics simulations, like just in the field of computer graphics. We probably don't want to throw away actual ray tracing, actual physics simulations.

But it seems like maybe the physics simulator maybe can just render one frame per second and then the LLM layer can hallucinate the next second. That seems to me like what's going on with the human brain, right?

Where we've got the parts that are smarter and use a different architecture, but then we also interpolate the hell out of it using the cerebellum or whatever.

Steven: Well, an important thing to know about the cerebellum is that there's sometimes people born without a cerebellum and they turn out okay.

They walk a little funny, I think. Maybe they're not the next Einstein, but they can hold down a job and live independently.

The cerebellum in the context of a human isn't - I really don't think that we need to be thinking about it. I don't think it's doing anything important.

I do think the cortex also does predictive learning, and I think that predictive learning in general is a very useful thing to do, and that's why the AI people invented it, or I guess reinvented it, and that's why evolution also uses it all over the place.

Liron: Do you know what's the problem with the people who were born without the cerebellum?

Steven: You would think of it as, you need a lot of online feedback in motor control. So every fraction of the second, you're noticing that your foot is just a little bit to the left of where it ought to be, or the floor is just a little bit higher, so you need to dynamically adjust.

And if there's this giant lag between - you have to send a signal all the way down to your foot, and meanwhile, your proprioceptive signals take a while to get all the way back up to your brain.

You're just gonna have a heck of a time doing that kind of motor control adjustment with such big time lags.

Liron: Okay. So I think where this is going is I personally feel kind of convinced that what we've done with deep learning and with LLMs, I think that we've put our finger on a big thing that the brain is doing.

And I know you think, which I agree with, that it's not all of what the brain is doing, but do you agree with me that it's a big part of what the brain is doing?

Steven: I'm just gonna give the weaselly answer saying that LLMs are like brains in some respects, and unlike them in other respects. And I agree that the predictive learning aspect of LLMs is also something that brains do.

Liron: Right? Specifically it seems like the cerebellum, right? I'm obsessed with this concept, but it seems like a pretty close analogy to me.

Steven: Again, the cortex also has predictive learning.

Liron: Okay. Fair enough. And the other reason why I am thinking that LLMs really are the real secret sauce, not all of it, but I think that we have actually nailed half or whatever, some fraction of the actual secret sauce that the brain uses.

Just because this idea of these high dimensional vectors, it just makes so much sense. It's just so easy that you can just have a high dimensional vector and that will just encode the semantic meaning of something in a subtle way.

And the relationship to other vectors - just linear operations. A lot of the true meaning of words is just like their coordinates in this high dimensional space. It just seems like that's gotta be out of God's book. Why wouldn't the brain do that if it's so powerful and so simple?

Steven: Yeah. Maybe the brain does do that.

Liron: All right. I guess that's as far as you go. That's fine.

All right. So just to summarize again, so the cerebellum was really good at prediction, but you don't really need it. So if you want to have 150 IQ, it's probably not because your cerebellum is contributing that much. It's probably more your cerebrum and your cerebrum is 80% of the brain's volume.

So when you think about a brain, visually, I guess you're mostly thinking about the outside of the brain with all of those folds. And you're thinking about the outer layers, which are like folded cerebrum, correct.

Steven: Yeah, so the cerebrum, I think, is a little bit broader than the cortex. The cortex includes the hippocampus, which is three layers, and it includes the neocortex, which is mostly five or six layers.

And it's the wrinkly thing. If you unfolded it, it would be about the size of a dinner napkin. It's really contorted to get it to fit in the human skull, but it's this sort of thin sheet.

And then there's all this white matter, which is how the cortex connects to itself and to other parts of the brain.

Liron: And just help deconfuse me here. So cerebrum versus cortex versus neocortex. One more time.

Steven: Cerebrum is a broader term that also includes some other things outside of the cortex. I think it also includes the striatum, which I think is a separate learning-from-scratch system.

And then it also, I think includes the pallidum, which I think actually should be lumped in with the hypothalamus and brainstem as part of the repository of yet more business logic.

The divisions of what are the different types of cells and how do they look structurally - when you start looking at that, then you find that the whole cortex looks pretty similar to the rest of the cortex. (Leaving aside the allocortex versus isocortex and other finer distinctions.)

And the whole striatum also kind of looks uniform. The striatum includes the caudate, the putamen, the nucleus accumbens, lateral septum, this and that. But it all includes sort of the same type of neurons connected in kind of the same way, with some interesting exceptions that we can get into.

So I think the cortex is a big learning algorithm. And then the striatum is a different, somewhat less big learning algorithm that's interconnected with the first one.

Liron: And you said cortex is neocortex plus striatum or what's neocortex again?

Steven: Neocortex is a term that means the new, the part that was new in mammals. So basically if you go back to our common ancestor with lizards, the early amniotes had, it's believe, 100% three-layer cortex, which is called allocortex.

So we still have some three-layer cortex, including the hippocampus, which is famous for navigation and memory, and the piriform cortex, which is involved in smell.

But then we also got this big part of our cortex, most of it, 90 plus percent probably, that is six layers. So presumably it does something more complicated. I still think it's a big learning algorithm, but it's a more complicated one.

And it's called neocortex because it evolved in mammals.

Liron: Reminds me of the Gillette razors, right? You just have more layers in your cortex, more blades on your razor, you can learn better.

Steven: Yeah. And the human visual cortex has sublayers for some of the layers. So if you actually count, you'd get up to nine, 10 layers or something.

Liron: Amazing. Okay. All right. Now talking about business logic, you wrote in one of your posts that understanding this tangle of business logic, meaning in the steering brain, meaning the hypothalamus and brainstem, understanding this tangle of business logic is annoying but possible.

And then you did an exercise, you challenge yourself to go and understand laughter because you think that laughter is totally business logic.

Steven: That's right. I think I have this post "A Theory of Laughter" where I propose a certain connection, a hypothesized cell group in the hypothalamus. And what would be its inputs and what would be its outputs.

And I argue that if the cell group exists, then it would explain everything about our everyday experience of laughter. It would be evolutionarily plausible. It would agree with how we observe laughter in chimpanzees and in mice.

(Yeah, if you tickle a mouse, it actually laughs. You need ultrasonic equipment to hear it, but people do tickle mice.)

And it's neuroscientifically plausible.

I don't think anyone has measured this particular hypothalamus cell group if it exists, but it's an experiment that somebody could do. I hope that somebody does it at some point. It would be pretty cool.

Liron: Right. And it's a special case of what you're trying to do, which is basically reverse engineer all this business logic. Because basically what you're saying is like, yeah, there's a lot of parameters. It's not going to be one line of code, it's not gonna be a simple instruction.

Maybe it'll have thousands of instructions, but somebody should just try to read out just all of this low level assembly language, business logic.

Steven: Yeah. I think that would be a fruitful and helpful thing. I hope people work on it.

So there's a lot of people who do this sort of artisanally, where they are measuring one cell group at a time. I guess you would call it “tracer studies” in this case. So they measure what are the inputs to the cell group, what are the outputs, and they measure what genes are expressed.

They often do optogenetic things where they turn the cell group on and off and see how it correlates with behavior.

And those are really gold standard. You can learn a whole lot from those studies. I would take one study like that over a hundred fMRI studies, in terms of learning things about the brain.

And meanwhile, there's also groups that are trying to do connectomics at a large scale. You're not doing probing each group one at a time and correlating it with behavior, but rather you're taking some entire brain and trying to measure how everything connects to everything in this big giant data set.

So there's a number of groups, especially E11 is one of them, that are doing connectomics and it's this exciting area.

I very much wish them luck. Whenever I talk to machine learning people, I tell them that there's cool connectomics jobs, because you have to do all this really tricky image reconstruction stuff.

Language Acquisition and Learning

Liron: Let's talk about language acquisition. I think if you go back to the 1970s and 80s, Noam Chomsky was famous for basically saying, look, language is innate business logic, right?

Things like it has a structured grammar that's just hardwired in your genes and you just have this brain that just already from birth knows what to look for in grammar.

But I also think today it seems like there's plenty of online learning happening too. I mean, certainly we all - the dictionary is obviously learned, right? So what's your view on language acquisition?

Steven: Yeah, I think that if you look at the language areas of the neocortex, they look pretty similar to all the other areas of the neocortex.

I don't think there's - I do think there's a little bit of business logic involved in, in particular, the part that says that if somebody - if you hear a human voice, then you should pay attention to it. Also the part that says you should try talking.

Without those, people might not be motivated to acquire language. So motivation kind of has to come from - there's no God telling you what's interesting and what's uninteresting.

If people are motivated to count the pebbles on the sidewalk, then they could do all that all day and learn a lot about pebbles. But humans, especially human babies, tend to pay attention to other people. What are people saying? What are people doing?

And I think that is related to motivation, and that has to come from the hypothalamus and brainstem. And I have ideas about exactly how that works.

Liron: If you look at the LLMs that can talk to us now, right? It's like the Richard Sutton bitter lesson came into play big time, right? The fact that these just trained on the next word and they didn't explicitly train on the rules of grammar, although technically, I guess they did go and read Strunk and White or whatever as part of their training.

But I think just by reading the next - it doesn't seem like the Strunk and White or the grammar textbooks are how they learn grammar. It seems like they really just learned it using a general learning algorithm.

So that is a pretty big signal that Chomsky was kind of wrong about the amount of hardcoded business logic - it's just, it's more online learning.

And if you think about the evolutionary explanation, I think it stands to reason that the genes are trying to be efficient and our genes are trying to offload as much as possible into the generic learning algorithm and only hard code the smallest possible bits, just because it's hard for natural selection to keep up the integrity of these bits.

Is that right?

Steven: Once we start talking about learning algorithms, I would shift the discussion from business logic to inductive biases, and I do think there's some evidence that the inductive bias of the cortex is different from the inductive bias of LLMs.

For example, if I understand correctly, babies will make language mistakes that LLMs-in-training do not make, like overgeneralizing grammar rules and stuff like that. I only heard this secondhand, so don't quote me on that.

But clearly the success of large language models at doing language is evidence that it is at least possible for a large-scale learning algorithm to acquire language.

And I think that's some indirect evidence that likewise, the cortex could be a big learning algorithm that acquires language amongst all the other things that it's able to do, like motor control reasoning and so on.

Some people talk about language is special because it has recursion or something, but if you look at a picture in a picture, that's vision, but it has recursion. If you talk about making a plan that has a sub-plan, that's planning, but that has recursion too.

Music can have recursion, all these things can have recursion. I think that's not something special about language. It's just something that's in the space of things that this learning algorithm is capable of doing.

Liron: Which is pretty incredible, right? I mean, it certainly wasn't my intuition. If you'd asked me 10 years ago, do you think this kind of deep learning algorithm can handle a lot of recursion? I'd be like, I don't know, man. I feel like recursion, you really need to program in the structure.

But as you say, right, it can chomp through a big text that has a lot of recursive grammatical structures in it, or self-referential context and it's not really a problem. As long as you don't go a hundred recursive levels deep, which nobody does, it just seems fine.

I mean, isn't that a little surprising that it can handle recursion?

Steven: Yeah. Humans can't go a hundred recursive levels deep either, right?

Liron: Exactly. Exactly.

Steven: Yeah, LLMs are very impressive. They're definitely able to follow recursive structures.

LLM Limitations

Liron: On this show, it's a common question that I ask my guests like, okay, you think that AGI isn't coming soon because you think LLMs are limited, but what's actually the barrier?

And I get such a wide range of answers. Some people are like, well, LLMs don't have agency. They're all ultimately controlled by humans. I'm like, I don't know man, if you put them in a loop, they seem pretty agentic to me. They just keep deciding the next thing to do. So that's like a weak example.

And I just did a reaction episode to an interview with Amjad Masad and Amjad Masad saying, well, they don't think about novel concepts. They can only remix. And I'm like, well, if you just keep remixing and then you remix your remixes, can't you make novelty that way?

So I'm never convinced by these answers, but you are at the forefront of convincing me of what the limitation of an LLM might be. Because also, I had Dr. Andrew Critch on the show, and he was very explicitly on the stance of like, there's no barrier. You just keep scaling up LLMs, everything is just going to fall. Maybe there'll be small barriers, but I don't see a big barrier.

But it does seem like you see a barrier because when you look at the human brain, you can classify parts as being LLM-like parts. My interpretation is the cerebellum is very LLM-like.

But in your mind there's some parts of the human brain that are doing some other smart thing, which is very powerful, which LLMs are just never going to get to, right? Like you see some sort of barrier.

Steven: Yeah, I should start with a caveat that I'm usually not very enthusiastic about the enterprise of arguing people out of their belief that LLMs will scale to AGI.

And if I'm talking to an alignment researcher who's preparing for the contingency that maybe LLMs will scale to AGI, then I'm like, okay, that seems like a kind of reasonable thing to do.

I mean, I'm not so confident that I don't want to - I want to grow this field of non-LLM-AGI safety, but I don't want to do that by parasitizing from the already pathetically small field of LLM-AGI safety. I'd rather bring new people into the field.

But you asked the question and yeah, I do think that LLMs will stall out before they get terribly powerful and dangerous. I mean, they'll be dangerous in some sort of more minor ways, but not dangerous in the “robot army wiping out humanity” kind of way. Or inventing plagues and so on.

LLMs are already on the verge of mass spearfishing, which would be bad. LLMs are already on the verge of virtual friends that are social media meets fentanyl, but way worse - all these things are bad enough.

Without the robot armies and plagues wiping out humanity. But if we're just talking about the actual taking over the world and wiping out humanity, then yeah. I don't think LLMs are gonna do that. Not now, not ever. Not with reinforcement learning from verifiable rewards, not with scaffolding.

I do think that there's missing some stuff.

Liron: Well, that's certainly good news and that certainly advises why you're not super worried about AI takeover in the next two years. Because you think there is another paradigm shift coming and of course it might come in five years.

It sounds like you wouldn't be that surprised if it comes in five or 10 years.

Steven: Yeah, it's hard to know. I certainly wouldn't rule out as soon as five years, just because a lot can happen in five years.

I like to remind people that five years got us from “LLMs did not exist at all” to 2023, where there was already billions of dollars going into them. Tens of thousands of hours had been poured into LLMs between 2018 when they didn't exist at all, and 2023. So that's five years, and a lot can happen in five years.

Liron: Right. As one example, five years is a time where back five years ago, everybody's like, well, Google stopped working. I used to be able to type these long queries and get these useful resources, and now it's just a page with spam. Google doesn't work anymore.

Whereas today, if you ask me what do I think of web search? Sure, maybe I don't raw dog Google. I don't type my query into Google, but when I type my query into something like GPT-4o, or to be fair, Google's AI mode, I type my query into something like that.

And I get a better quality search than I ever dreamed of because what happens is it spends the next minute running a bunch of adaptive search queries and reasoning and reading the webpages for me and synthesizing everything. I've never dreamed that search could work that well.

So we went from Google stopped working to, oh my God, does it work?

Steven: Yeah, a lot can happen in five years. There's also the anecdote where one of the Wright brothers said that he was 50 years away from flight and then he achieved it two years later.

Liron: Exactly, exactly. Alright, so I have some follow ups, but let me go back real quick just to your model still of the two brain subsystems. I just want to poke at that a little bit more.

Let's take a simple goal, like relieve my itch. Like, oh, I got an itch on my arm. I want to relieve my itch. So do I have to use my steering brain to do it or can my learning brain just learn? Can the cerebellum just be like, oh yeah, I know the pattern here. You just move your hand.

Steven: I think scratching an itch is motivated behavior—the sort of motivated, flexible behavior that goes through the within-lifetime learning.

Basically the story would be that there would be some part of the brainstem, I don't know, the nucleus of the solitary tract or something. I don't remember exactly what's responsible for itches.

But that gets ground truth from the body, a certain type of nerve cell that goes through the spinal cord, dorsal horn or something.

And then it would send a number of signals to different places in the learning subsystem. So one thing it would do is send an interoceptive sensory input, so we would feel a particular type of feeling that we would describe as an itch.

And another thing it would do is send what I call involuntary attention. And that makes it hard to think about anything else except for the itch. So we want to be thinking about math homework, but our attentional pathways have been hijacked by this hypothalamus or brainstem cell group. And it forces us to be thinking about the sensation of itchiness.

And then yet a third thing would be going to the sort of reward function of our reinforcement learning system and saying this feeling of the itch is bad, and when we scratch the itch and we feel that feeling of relief, that that's a good thing.

So when you put all those ingredients together, it becomes very easy for the little baby to learn that scratching an itch is good, and thus the baby does so.

Liron: So that all makes sense. Is it possible that it's all a short circuit around the steering brain? Like we don't need a lot of business logic for that?

Steven: I don't think there's too much happening in the steering brain. I just think it's this little cell group with one input and three outputs, or four outputs or something.

Liron: And so I mean, this is the distinction that I want to dive into, right? Because I think that the steering brain, you would argue, is really not analogous to LLMs, and it's still part of the human only, right? No AI is allowed human only secret sauce.

And yet we can see that some behavior like scratching an itch seems to route around it. And I guess the question is just how much can LLMs do things like - how much can you scale up scratching an itch, right? What if the itch is conquer the world?

Steven: So I think we should think of the steering subsystem as a little bit like a reinforcement learning reward function. It does other things too, but if we think of it as mainly being a reinforcement learning reward function, then that would be a little bit analogous to the RLHF reward function in LLM-land, or to the RLVR reward function.

And so people already do reinforcement learning with LLMs. I don't think that's the secret sauce, to have some special reward function.

Liron: Yeah. So I'm trying to poke at your distinction between the two types of brains. 'Cause I'm trying to think what is the minimal level of challenge, the minimal complexity of a goal where you can't just have these circuits that get wired up online using reinforcement learning - the steering system really has to come into play.

I think you've used the example of long-term goals, like make a profitable business. You definitely have to use a steering system for that. You can't just rely on a loop. And I guess because time is just way too long.

Steven: So when I think about one of the things that large language models are not great at, I think they kind of struggle when you put a lot of interconnected complexity into the context window that is not already on the internet.

And the more interconnected, layered complexity you pile into the context window, the worse the LLM does.

But if you, I imagine training some LLM where you purge every mention of linear algebra from the training data. And then you put a linear algebra textbook into the context window and it has all of these new concepts.

So it has bases and kernels and singular value decomposition and inversion, and matrix exponentials and a gajillion other things that all relate to each other and pile on top of each other. And then you ask the LLM to do the exercises.

I think it would do a terrible job just because the context window is just not good at that kind of thing. It needs to learn things into its weights to have decent performance.

So that's kind of a contrived example, but if you think of running a factory, you likewise, over the course of time, wind up with idiosyncratic complexity and domain knowledge that's not on the internet.

So I know that this machine is good for this and bad for this, and how it interconnects to this other machine and how it interconnects to the noise regulations in the town and what my different employees are good at and bad at. And the things that I've tried and everything about my suppliers and everything about my customers.

Too much complexity in the context window and the LLM starts to fall apart.

And that also kind of explains why LLMs are so good at self-contained problems, compared to getting hired as a complete drop-in replacement human employee, which they're basically not doing much yet.

Liron: Yeah. That's sort of how I think about at least one aspect of LLM limitations.

Liron: Gotcha. I also want to ask about animal brains. Do some animals excel at one type of brain or the other? Like, is there some animal that's not human that just happens to have a really good steering brain the same way we do?

Steven: There's definitely variation in the animal kingdom about how much of their behavior comes from flexible within-lifetime learning versus how much of their behavioral repertoire comes from the steering subsystem.

It's not just variation across the animal kingdom, but also a variation within your lifetime. For example, a newborn is almost entirely brainstem driven in its behavior, whereas an adult has very little of that comparatively.

We still have our flinch reflexes. We still swallow, we still vomit, we still have vasoconstriction, but by and large, most of the things that we're doing on a day-to-day basis are learned.

And if you compare that with a frog where its entire cerebral part of its brain is I think 25% of its brain volume as opposed to 80% for us. It's gonna be doing a lot more instinctive behavior throughout its life.

So presumably, I've seen newborn babies - they're, with all great respect, they're not very impressive in their ability to navigate the world because humans have evolved to be so reliant on these learned behaviors.

And the animals, presumably, that do less within-lifetime learning are gonna be correspondingly better at navigating through innate reflexes.

Liron: Do you think it's accurate to say that a lot of animals do have comparable steering departments in terms of brainstem and and hypothalamus, but at the end of the day, they just can't learn that much, and so their steering options are limited.

Steven: It's hard to compare. Frogs have the right instincts for being a frog, and we have the right instincts for being human.

And part of that is that our steering subsystem is doing stuff that will help guide the learning subsystem to learn useful things.

And if you don't learn as much within-lifetime learning, then that's less for the steering subsystem to do, and it makes up for that by having more to do on the sort of motor control side.

I don't want to build some hierarchy about whose reflexes are better or worse. Every animal has reflexes that are very impressive and help it survive and thrive in in its own niche.

Liron: It seems like it's kind of obvious to look at humans versus other animals and say that we have a much bigger neocortex, which gives us a much bigger learning brain, and so we learn better than the other animals and we can learn more complex things than the other animals.

I feel like that's kind of a consensus account and then you would also agree with that account or what's an interesting piece of nuance there?

Steven: I don't have a strong take on the secret sauce of human brains as compared to, I don't know, elephant brains, which have a comparable amount of neocortex, I guess, or I don't know exactly how the cell counts compare, but maybe the cell types are different or Yeah. I don't really have much of a take on that.

Liron: Let's just state the obvious. I think you and I are both on the same page, that it's not like some non-human animals are secretly so smart in a way that we can't imagine. Like it's pretty obvious that humans are the smartest, right?

Steven: I'm sure there's things that chimpanzees are better at than humans, and likewise, elephants. I do think that there's an important fact related to the fact that humans invented technology and went to the moon, and elephants did not, and orcas did not.

It's the same thing that humans did by inventing - we colonized every continent on earth and we clearly are able to use technology to accomplish goals in a way that other animals are not.

And the thing that humans are doing that enables them to do that is something that future AI will be able to do too. So that's kind of what's relevant to my own interests.

Liron: Exactly same here. The steering brain. We're talking about the brainstem and the hypothalamus. I think it's an interesting contrast to look at what Google is telling me about these brain regions by citing Cleveland Clinic and then to look at your interpretation.

So for example, when I search for hypothalamus. Google is highlighting that the hypothalamus helps manage your body temperature, hunger and thirst, mood, sex drive, blood pressure, and sleep.

But in your mind, it's maybe doing more than that in terms of steering the human's goals.

Steven: Yeah, it's doing all those things and more. There's definitely strong evidence that social instincts are also one of the things in the hypothalamus, which goes beyond sex drive and aggression, but also includes things like parenting.

There was a recent study that found that mice get lonely and they long for the comforting touch of another mouse. And the lonelier they get, the more comforted they are when they finally got to hang out with a new mouse friend.

And there's - they found a little cell group in the hypothalamus that tracks how long it's been since the mouse has been with another mouse. And over the course of days, its activity ramps up and up and up.

And then once they get caressed by another mouse friend, this different cell group turns it back off.

So that's the hypothalamus doing something that's very obviously related to sociality. So humans have this - I guess we humans have this sort of sense that our morality and our sense of justice and all these things are given to us by God in our immortal soul.

People might describe it more sophisticated than that, but we don't like to think that these quote unquote higher instincts are just another cell group next to the one that controls your blood pressure.

But really I think there's good evidence that it is just another cell group next to the one that controls your blood pressure, out in the medial preoptic nucleus or whatever of the hypothalamus - that's what it is for mice.

That's sort of what makes sense in terms of AI and in terms of - you can't get an ought from it is. If we want to follow human norms, there has to be some motivation coming from our brainstem and hypothalamus that says it's good to follow norms and it's bad to violate norms, and there has to be some cell group implementing that.

Liron: Exactly. So if you're correct that what we think of as morality is business logic, that our genes right into our hypothalamus or that region somewhere.

In that case, we certainly believe in the Orthogonality thesis where it's easy for us to imagine, okay, well imagine changing the genes. Imagine changing the business logic. Then you change morality, at least with respect to the new organism, right?

You and I might agree, no killing people is wrong no matter what. But the new organism will disagree and there may not be anybody to stop him.

Steven: High-functioning psychopaths actually exist in the human world. We like to pretend that they don't, but they really do. And I think we can learn something from the fact that they exist.

Liron: Exactly. So from my perspective, it's like, what else is there to say? Right? This is so obvious and yet half the guests come on this program, and we have to debate whether morality is universal or whether the orthogonality thesis is true.

Steven: Yeah, I'm definitely very strongly on the side of, you can't get an ought from an is. I think that the oughts come directly or indirectly from this reward function, and the is's come from the cortex learning, using predictive learning to build a better and better model of the world.

Liron: Yep. Okay, so we did hypothalamus. Now I'm looking at clevelandclinic.org. For the brainstem, it says, brainstem helps regulate vital body functions that you don't have to think about, like breathing and your heart rate. Your brainstem also helps with your balance coordination and reflexes.

Steven: Yeah, there's often a sort of hierarchy where the hypothalamus is kind of deciding what reflex programs to do. It's deciding whether it's a good time to laugh.

And if it is a good time to laugh, then the hypothalamus will trigger various cell groups in the brainstem that actually coordinate which muscles are going on in which order.

Or the hypothalamus will say this is a good time to vomit. And then the brainstem will actually trigger the vomiting reflex. First it's this muscle, and then it's that muscle.

Liron: And you're saying that there's a connection between regulating homeostasis and keeping all your organs in line and having goals. I mean, I feel like that's a pretty key insight that you're bringing here.

Steven: Yeah. I mean, vomiting is unpleasant. So obviously if people don't like vomiting, it's a sign that the vomiting reflex in the hypothalamus is also wired up to this reward function, which says bad thing is happening right now.

And then you learn from experience that if you're making a foresighted plan that ends in vomiting, then that's, other things equal a reason not to do that, with apologies to the college students listening in.

Brain-like AGI

Liron: Now, your vision, basically what you think is the most likely thing to even aim for. You describe it as brain-like AGI with the better parts of reverse engineered human social instincts. But it is autonomous, correct?

Steven: Yeah. I think there have to be AIs that are thinking about what would constitute a good future and autonomously making decisions in that direction as opposed to following specific directions from humans.

So this can come about in a number of ways. We can talk about the starting point being, follow person X's long-term best interest. Even if person X doesn't realize that something is in their long-term best interest, do it anyway.

And then one potential vision for how to get there is maybe: humans are able to autonomously act towards a good future. We hope so, whatever humans are able to do, humans are able to do moral philosophy to some extent.

Maybe AIs that are sufficiently similar to humans can also do moral philosophy for the same reasons. Things like that. That seems worth looking into. Again, I don't have a plan that I feel great about, but that's one of the directions I'm looking into.

Liron: All right, viewers, so now you have a sense of where Steve is going with all this research and all of his ideas kind of originate from trying to get to this endpoint.

With that context, before we talk more about alignment, let's go back and talk about some of the core discoveries that you think - I mean, maybe they're not fully consensus yet, but they make a lot of sense. We're gonna talk about the core discoveries that you've made on your way to solving this alignment problem.

When I hear you talk about brain-like AGI, I feel like you would've wanted to just say AGI. But now that we have these LLMs and LLMs are these weak AGIs that are weaker than we thought AGIs, were gonna be.

Now we need a more precise term for the AGIs that are going to fulfill the true destiny of AGI. So you call them brain-like AGIs?

Steven: Yeah. I mean I mostly want it to be agnostic. You could imagine that there's more than one way to build AGI. I mean, in principle, if cost is no barrier, there's definitely more than one way to build AGI, like you can do a blind search over all computer programs and you'll eventually get AGI that way. You can do computable approximations to AIXI, eventually get AGI.

So I was saying this is a contingency that we should plan for. This is one way that an AGI computer program can come about.

Liron: It's a lower bound on true AGI as being brain-like.

Steven: It's a scenario. And I do actually think it's the most likely scenario, maybe the only possible scenario, but I didn't want to - I mean, it's hard to argue for that.

Liron: And it's also possible that every AGI is kind of brain-like, I mean, we don't even know.

Steven: Yeah, I would put that another way and say that it's possible that brain-like AGI is, in the sort of broad sense that I use the term, is the only, or at least the only practical way to build AGI.

I like to bring up the Fourier Transform case. If you're going to do a Fourier transform on a large dataset, you should use FFT, and FFT kept getting reinvented over and over through the course of history because that's just the best way to do FFT. And if we ever met aliens, then they would be doing FFT too. That's just the natural algorithm for solving this problem.

So by the same token it's possible that the natural algorithm for getting ambitious projects done in the real world is brain-like AGI.

We don't know that for sure. It's just a possibility.

Liron: One observation about brain-like AGI that I think is pretty important is that it seems like evolution had a pretty easy time building human brains. Just because I'm observing the brain size difference and the intelligence difference between other apes and humans.

It happened really fast in evolutionary times, so it's not like natural selection had to refine it over so many eons. It's like, oh yeah, just add more neurons and then you can be really intelligent.

Steven: Yeah, I mean, it's a little hard to tell. There's, you could point to evidence in the other direction too, like the fact that we didn't have a technological species for the last 500 million years of vertebrates and why didn't elephants and orcas invent technology?

I don't know. And how many different pieces of the puzzle happened to already be in our brain to seed the way for human-like intelligence.

It's just, yeah, it's a little hard to - I wouldn't say that there's a clear cut answer.

Liron: Well, if you don't want to agree that it was easy to build intelligence, then the only other thing you can claim is that other apes already have most of the intelligence that humans have.

But then there's the observation of like, okay, but we are the boss, not them. Right? Like there's a huge difference in terms of how much we're achieving.

So it's just interesting that this last spurt where all the effective power is, was an evolutionarily easy spurt.

Steven: Yeah, that's for sure. There's not that much evolution that happened between us and chimps.

Liron: And that gives me a strong intuition that we really are likely on the eve of AGI like whatever that last spurt is gonna be. It's, we're probably close.

Steven: Um, yeah, I agree with the conclusion. I'm not sure that that's a strong argument because reasonable people disagree about how how many of the other ingredients are still missing.

But I do think you're right that we're gonna get there with sort of one big missing piece.

Actor-Critic Reinforcement Learning

Liron: Let's go back to that question and the question of what can't LLMs fundamentally do. People love talking about the subject because it's very interesting, right?

We have these LLMs, everybody has been at least a little bit surprised about how many things they can do, and yet so many people are making so many claims about what they can't do. And a lot of those claims seem to get disproven.

Like there's the famous Scott Alexander versus Gary Marcus and other skeptics, where the skeptics are saying like, look, these images, they're hallucinated. You're never gonna have a horse riding a person, it's always going to draw a person riding a horse, even if you prompt it to do a horse riding a person.

But then the next version comes out and it successfully draws a horse piggybacking on a person.

So there's all these back and forth debates about what these LLMs can't do and the goalposts seem to keep moving and you've kind of surprised me by you think you've taken a stand over what LLMs can't do.

And you use the example of you would be shocked, it would totally break your model if you could just have an LLM based system that could create a successful, profitable company.

Steven: Well, I mean there's, yeah, there's these toy examples on the internet where the LLM is like going “I’m starting a business!” and then Mark Andreessen is like, “Hey, fun, here’s $10 billion”, or something. And you could call that a business, but it's kind of a silly thing.

If we're talking about a robust business, Amazon style business where they build things and they come up with new business models and they actually execute over an extended period of time and they have management and they have employees, that kind of thing.

I think Jeff Bezos is still head and shoulders above LLMs at being able to do that.

Liron: Now, when we use metrics like novelty and creativity, this is a very popular choice of metrics of people trying to argue that it never strays far from its template. So it can never be novel, it can only remix.

Do you think that's a productive metric, or do you think we have to find a different metric?

Steven: So there's a thing that humans can do. Like if I say, “imagine a purple colander with a bird on top, falling out of an airplane”.

And you can just immediately create that image in your head. And I can ask you follow up questions about it, and you'll be able to answer those questions very skillfully, leveraging all of your understanding of aerodynamics and airplanes and birds.

Is that, like, a creative thing that you just did? Maybe. And it's definitely this kind of thing that LLMs can do. It involves remixing existing ideas in a way that has never happened before, so far as I know.

I mean, there's the thing where LLMs fail in really quite baffling ways sometimes, and it's hard to know exactly what to make of LLMs saying things that are stupid and in a very inhuman and baffling way.

I'm sort of more likely to emphasize what I said before about the context window. The more that you pile stuff into the context window, the worse the LLM gets and the stuff that's not in the context window has to be stuff that humans already created on the internet.

Liron: But when you put it that way, it suggests that maybe there's ways to work around it being like, okay, well what if I just use two prompts, right? Where it's like half the stuff in each prompt and then merge them.

Steven: Yeah. I'm sure people will try whatever possibilities they can think of, and I just think that this is a problem with the LLM architecture.

Liron: Another way for me to ask the question about where you think the LLM barrier is, let me ask it this way. What do you think is the least impressive thing that AI almost certainly still can't do in two years?

Steven: I'm reluctant to make a strong stance on this, first of all, because I'm not an LLM power user myself.

And second of all, I think the - yeah. Second of all, I am, I'm not asking people to believe me when I say that LLMs will hit a wall. I'm sort of making a research bet, but I'm not trying not to screw people over in the world where I'm wrong about LLMs hitting a wall.

And third of all, so for humans, you can just - as a human works on a problem over the course of - as they build their factory and run it over the course of days, weeks, months, years, they pile more and more idiosyncratic domain knowledge about this particular factory knowledge that's not on the internet into their brain.

And LLMs are, I think, unable to do that kind of thing. But what they can do with the context window has been incrementally increasing.

I think that I don't want to make a strong stand about exactly where the ceiling is gonna be in two years.

Liron: Yeah. I mean, I don't exactly disagree with you that LLMs are limited, right? Like, I mean, I can see it too, right? I can - I mean, the most surprising thing to me is that you'll have these state-of-the-art systems that are giving you such insightful advice about medical conditions or analyzing human relations, like really deeply insightful stuff, like better than humans, no question.

Then you'll ask them a very simple question, like even some simple arithmetic, and they'll confidently give you like an obviously wrong answer. Right? That's still shocking even to this day, right? So I agree. There's something up, right?

Like there's something to be explained here. Maybe there is some fundamental limitation that has to do with robustness.

It's just that it seems like the LLM model is so flexible. The LLM model is so flexible that like anytime you notice something wrong, a lot of times you can patch it by being like, okay, well just point them at this giant terabyte of data and then the problem will be fixed.

And really, you just have to point them at a ton of data and I mean, that was pretty easy. Right. Now we're talking about like, oh yeah, and the future, there will be a few humans doing something and pioneering it, and then the AI will take over rapidly.

It's like, huh, okay. So, and you don't think we're that close to automating those humans, to just patching our way to not even need the humans. I'm just not seeing a hard separation.

Steven: So like, if the AI has been running this factory for six months, then there isn't any pile of data to point an AI at for all of the details about this particular factory - details that don't exist on the internet, like these, this particular way that this particular machine works with this particular set of employees.

That's sort of a humdrum example, but of course people are more interested in things like inventing a new scientific paradigm. Like if there's, I don't know, take some random physics concept like quantum capacitance, and say that if the LLM doesn't quote unquote understand all the nuances of quantum capacitance.

How are you gonna create more data for the LLM? The way to create data is to understand quantum capacitance. So that's sort of this sort of chicken-and-egg problem where I don't know where you're supposed to get that data from.

I'm sure that AI companies, as we speak, are trying to solve that puzzle, and I wish them failure, but we'll see how it goes.

Liron: So I think the naive proposal for how LLMs could take over the world would be like, okay, yes, of course they need online learning, right? So you do need a loop where the training is happening constantly.

Or worst case, maybe you could argue humans do this, right? You go to sleep and you train overnight. I mean, if they want to compete with humans, it's okay if they have eight hours of downtime per day, right? To retrain themselves. So that's like a pretty generous window.

So the naive proposal is you just have better and better LLMs and they go out into the world and they assemble new knowledge. They've got a camera attached to them, right? And they're collecting all the data for the day, and then they go to sleep and they train themselves on all the new data and they wake up and they're just LLMs again.

And like, yeah, it's kind of naive. I don't think that's really what AGI is going to be like, but it's very hard for me to tell you where that model is going to tap out relative to humans.

Steven: Yeah. I would point to something like, again, if there's some concept that the AI doesn't understand. Then it doesn't have a way to magic new training data from thin air. Like if there isn't enough data on quantum capacitance in the human training corpus.

Then the AI can't just print out their own new facts about quantum capacitance and train on them, because the whole point is they're confused about quantum capacitance.

Liron: Well, the, I mean, the right, but the idea would be that the AI is able, the LLM is able to be employed as a quantum capacitance researcher and do the same daily steps that a human would do to figure it out.

Steven: Right? Yeah. So I think that the daily steps that the human does, the humans are still able to do continuous learning in a way that LLMs can't, as of today.

Liron: When you say continuous learning though, I mean that's, I'm not sure how long you're talking about because the premise would be that the LLM starts the workday with the same amount of background knowledge prior to that day. Then the human employee would start, right. You take it one day at a time.

Steven: Yeah. I mean, yeah. I think that humans are able to figure things out in a way that in the absence of training data in a way that LLMs can't, if a human realizes that something that they previously believed was wrong, then they can update their permanent knowledge to excise that incorrect belief.

Whereas, at least as of today, the way that LLMs are trained, that kind of thing doesn't happen.

Liron: So if I understand correctly, I think you're saying even if you initialize a human and AI with access to the same prior memories and prior knowledge, and they've read the same textbooks up to the same current state of knowledge, you still think that just in the eight hour workday, the human researcher has more going on in their brain that's going to let them take the lead versus the LLM?

Steven: Um, I think that, yeah, my claim would be that the human is capable of figuring new things out and adding them to its permanent knowledge over the course of the eight hour workday in a way that the LLM is not, and that you're gonna wind up with a slightly smarter human at the end of the workday and you're not gonna wind up with a slightly smarter LLM.

And then, maybe over the course of one workday, who cares? But iterate that over days, weeks, months. And the humans can gradually figure things out about their situation over time in a way that I think LLMs can't.

Liron: I'm not sure I agree with framing it as the human so good at adding new things to its knowledge day after day. Just because anytime knowledge is created, LLMs can vacuum up the knowledge.

So I'm not sure that it has to be accumulation of knowledge that you yourself came up with and then added to the pile.

I think it might be more productive. The way I would frame it is to just be more like it's robustness. So it's not just about adding knowledge, it's about looking over your day and noticing that a bunch of stuff you did is bad to learn.

Right? Or you made mistakes and you want to backtrack. Because that's my sense of LLMs. It's like they can do everything, but they have this mistake rate and they don't recover from the mistakes, right?

The mistakes build up and they don't have a robust mistake cleaning process.

Steven: Yeah. I mean, I think that's a little bit indirectly related.

Again, time will tell what the AI researchers come up with and certainly, if there's any way to make LLMs work better, then I'm sure that it's already on ArXiv today, and if not today, then tomorrow, and it'll be in the next generation of LLMs the day after that.

And we'll get to see how well it works, or does it work?

Liron: Isn't it weird though, that the LLMs will make these obvious mistakes, but a lot of times it's like, yeah, they make the mistake 20% of the time and then they get it right 80% of the time.

Isn't it weird that you can't just tell them to run over and over again and like, okay, keep checking your mistake, keep checking your mistake. For some reason, some of these mistakes persist even after a bunch of check yourself runs.

Steven: Yeah. And is that a problem that, you never know which kinds of problems are gonna get better in the next generation or which ones are not?

Liron: Yeah. Okay. So that's what's salient to me is the robustness and which is a really weird place to be in. Like, I never would've predicted that this could be the situation, but here we are.

There's one more thing I wanted to ask you. On the subject of what LLMs can't do. I feel like it's all the rage now to talk about the time horizon metric, right?

That graph of like, oh, we've discovered that if something is only a couple hours, then AI can surpass human performance on it. But if something takes a few days, then the human will surpass the AI.

Do you think that that's a good metric to look at and do you think it's like, yep, it's never going to get past a day? Like I'm putting a hard firewall on a one day time horizon.

Steven: I think that sort of dovetails with what I've been saying about putting a lot of complexity into the context window.

The longer that it takes, I mean, not every task is like that, but there's at least a vague correlation between how long it takes a person to do something and the amount of idiosyncratic knowledge, not on the internet, that you have to learn over the course of doing this task.

And that would explain why the more short and self-contained task is generally the better that an LLM is at doing it.

And I don't know, again, clearly the line is going up, so I don't know exactly when it will stop, but I think that it's sort of a different - it's different than how humans solve problems and I think there's something fundamental there and time will tell exactly where the ceiling is.

Liron: Do you think LLMs today can do everything that an IQ 80 human can do?

Steven: So if I wanted to think of a task that an IQ 80 person would do better, I would try to come up with something that involves figuring things out over the course of days and weeks.

I think that's where the human brain can shine relative to LLMs. Learning to juggle? I guess you would say that doesn't count because LLMs don't have motor control.

If you made a learning to juggle text-based game where you sort of mapped the - this is kind of stupid.

You'd have to put the person in a sufficiently complicated environment. Or here's another one. Give the person a funny gadget that you just invented in the next room and give them a couple weeks to play with it and they will get really good at using the gadget as a tool and putting the gadget on things and balancing the gadget on their head, and they'll understand it inside and out in a way, as if it was their own hand.

I guess that's kind of a stupid example too. 'Cause then you're gonna complain that LLMs are not as good at multimodal stuff.

Liron: I don't know. I think that there might be something to it and maybe instead of a gadget, maybe it could just even be like a Wordle or Sudoku or something like a game that you can still, or Tetris, right?

Steven: Yeah. I don't recall how Claude plays Pokemon is going these days, but it was not impressing me very much. It needed a lot of handholding last time I checked in on it.

Yeah, I'm not sure. I think IQ 80 people, I guess I don't really know what IQ 80 corresponds to in terms of people in the real world, but I imagine that they are at least potentially better at playing Pokemon than Claude without handholding.

Especially if you give the person a tutorial, which Claude was able to read every tutorial on the internet during pre-training.

Liron: I bring this up because before when we were talking about like, okay, AI is not robust. It makes these simple mistakes. It can't fix its mistakes, but I think we might have a tendency to be comparing it to above average IQ humans, and it's possible that it's just very close, or it's already just kind of cleanly beaten lower IQ humans.

It's hard to tell, right? Like maybe everything - maybe lower IQ humans are constantly making similar mistakes and just society is able to just recover from those mistakes, help them recover. Right? And we just don't notice it.

Steven: Yeah. I mean, semi menial jobs, like, I don't know, warehouse workers, grocery baggers. I do think, again, it's hard to compare 'cause LLMs don't have robot bodies, but if I try to imagine building a text-based system that's sufficiently analogous to bagging groceries, then I think that a below average person with weeks or months of trial and error, will eventually get better at bagging groceries than an LLM struggling in this complicated text-space grocery bagging system.

If you put it in enough low level details about, now you have to move your arm this way, and now you have to move your arm that way.

Liron: When you're saying, Hey, we don't have brain-like AGI, it's almost like your claim is that we don't have 120 IQ brain-like AGI.

Steven: Yeah. I think there's a better learning algorithm, a more powerful and scary and dangerous learning algorithm out there, and I think people are gonna discover it sooner or later.

Liron: But you see what I'm saying, right? Is even your claim about a separation between humans and AIs, it sounds like you're not as confident being like a 80 IQ human will hold their own against an AI for many years.

Steven: I think you'd have to pick the right task. Humans have a way of growing in their knowledge over time, the knowledge that they need, they gradually learn it in a way that I think LLMs struggle with.

But LLMs have a higher starting point, so it's a little hard to come up with those tasks, but if we're talking about things like wiping out humanity and running the world by itself, I don't think a good starting point and then not really being able to systematically learn as as you go in a human-like way. I don't think that's adequate.

Liron: All right. Now to summarize what you said before, you think what the human brain is doing is essentially better RL, right? Some algorithm that's a better member of the class of reinforcement learning algorithms than what we have now with LLMs.

But we also have other successful examples like AlphaGo of systems that are achieving things that LLMs can't achieve by using different types of reinforcement learning.

So I think maybe you and I are on the same page that some puzzle piece, some ingredient from the world of ML is going to come and integrate with the ingredients we have now in the world of LLMs. And then you'll just have something that's more brain-like.

Steven: Yeah, better model-based RL. And we can be agnostic about how much of the secret sauce is in the model versus in the reinforcement learning.

There's some still, there are still yet to be discovered secrets in the world of learning algorithms. There's a lot of learning algorithms. We haven't discovered them all yet.

Liron: Now you've written that you're expecting something that's actor-critic, model-based reinforcement learning. So could you explain how is actor-critic model-based reinforcement learning different from regular reinforcement learning?

Steven: Yeah, so I'm using the term in a pretty broad way. I do think that reinforcement learning in the brain is kind of different from reinforcement learning and reinforcement learning textbooks, AI textbooks.

So for example, some idea pops into my head like, I'm gonna open the window, and then that seems like a good idea or a bad idea. And if it seems like a good idea, then I actually do it.

So I would propose that there's one part of my brain, roughly the cortex that is the genesis of this idea. I'm gonna go open the window.

And a different part of my brain, namely the brainstem in conjunction with the striatum. That is like, that's a good idea or that's a bad idea. So that's kind of what I mean by actor, critic.

Liron: And what's the actor and what's the critic again?

Steven: So the actor would be proposing an idea. I'm gonna go open the window, and the critic would be like, That's not a motivating idea, don't do that, or Yes, do that.

And if the critic says yes, then I actually stand up and walk over to the window and open it.

Liron: Have you mapped actor and critic to different brain regions?

Steven: Yes. I think the actor is basically the cortex, to a first approximation, and the critic is basically the striatum and then the hypothalamus and brainstem are kind of a ground truth that updates the critic.

Liron: So the striatum is part of the learning brain, right? The learning from scratch brain.

Steven: Yeah. It's another, it's a different, it's its own kind of learning algorithm.

Liron: Yeah. Can you unpack that a little bit more? Because you've got the learning from scratch brain and you've got the cortex versus the striatum. So how do those interact when you're learning?

Steven: I think the cortex proposes a thought like, or an idea, a plan or whatever. I'm gonna go open the window. And then the striatum sort of takes as an input that thought.

Well, the striatum does a few other things too, but the relevant one for this conversation is that the striatum takes that thought and tries to judge whether it's a good idea or a bad idea, on the basis of these connections which have been learned over the course of your lifetime, based on what has tickled your innate drives versus what has not.

Liron: And that's kind of a layer before we also do the check-in with the steering brain.

Steven: Yeah. So the steering brain is updating the striatum in hindsight. And also this steering brain is sometimes overruling the striatum at the moment.

So if I'm in pain right now, then probably whatever I'm thinking about is a bad thing.

Liron: Yeah. Okay. So maybe it's kind of like layers of management where the striatum is just closer to the line manager and the steering brain is more of the CEO or the owner.

Steven: The striatum is making a guess at what's good and bad. And it has more to work with for its guess because it has all these neurons and all these connections to your rich understanding of the world.

Maybe the striatum is like the teaching assistant and the brainstem is like the professor or something. So the teaching assistant is grading the homework and the professor is sometimes noticing that something really bad happened and the teaching assistant didn't catch it and saying, yo, teaching assistant, you did this wrong.

Liron: Makes sense. When you look at Jeff Hawkins' research, do you think that is gonna have one of the ingredients for the actor critic reinforcement learning that we can copy from the brain?

Steven: So he is basically entirely focused on how the cortex works. He is very uninterested in, in fact, outright dismissive towards, understanding the hypothalamus and brainstem.

I think he has really bad takes on AI safety that come from his unwillingness to spend even five minutes thinking about how motivation works.

So for example, if you read his most recent book, if you read one chapter, it says the cortex builds a model of the world, and a model is not dangerous. By itself, it's just like a map sitting on the table.

Liron: Oof.

Steven: Oh, it's even worse than that. He talks about how the cortex wants to undermine the brainstem. The cortex is the home of all these higher drives, like morality, curiosity, and making the world a better place.

So there's this glaring contradiction between “the cortex is like a map on a table” versus “the cortex is the source of all these higher drives, like morality”.

I think the reconciliation is that in fact morality, just like curiosity, just like everything else are coming from the hypothalamus and brainstem.

And we're gonna need to reverse engineer and put those in our AGI. And the question is how do we do that?

Liron: Makes sense. Okay, so Hawkins is kind of at a dead end in your perspective.

Steven: I mean, if you want to know what the six layers of the cortex are doing and why, there's lots of people who are trying to figure that out. And Hawkins is one of many people who are theorizing about the exact role of the cortex circuits, and what is layer 4 doing, and what's layer 2/3 doing and so on.

Alignment Solutions and Reward Functions

Liron: Alright, getting back to your research, what you're focusing on, you're anticipating brain-like AGI and you're anticipating that this kind of new actor critic reinforcement learning that's getting us farther than LLMs have ever gotten us and is doing longer term horizon tasks successfully.

You're anticipating that this new kind of brain-like AGI is going to be referring to a new kind of reward function and you don't think it's going to look like predicting the next word. You think it's going to look like legible python code.

Steven: So I do think that part of the human brain is doing predictive learning. So you reach out to the door handle and you expect it to be quiet, but it makes a noise and you're surprised by that. You wind up with this complicated model of the world that's able to make really good predictions about what will happen in the future.

And I do think that works by predictive learning in a way that's structurally analogous to the next token prediction. I don't think it's a transformer architecture. But it is still a learning algorithm that's trained on predictive learning.

We can think about what's special about the reward function from a capabilities perspective, and then we can think about what's special about the human reward function from an alignment perspective.

From a capabilities perspective, I really just don't think it's that hard to make a reward function that would lead to powerful capabilities. I think alignment is much harder. But if you just want a reward function that leads to powerful capabilities in the context of brain-like AGI, you need curiosity, you need a few other odds and ends.

But mostly the reward function is important for alignment from my perspective. The learning algorithm and the update rules are what's important for capabilities.

Liron: That makes sense. I think one distinction you're making is we're not expecting the reward function to be an inscrutable mess. We're expecting it to be an actual Python module that a human can curate.

Steven: So if and when people make a brain-like AGI, then they can put whatever they want as the reward function. So I think that almost anything that they choose to put in would be terribly dangerous.

And the question is what could they put in the reward function that would actually lead to good outcomes? Like an AI that doesn't want to kill its programmers and its users and everybody else?

The direction that I'm most optimistic, or at least least pessimistic about would be a reward function that looks like more or less legible Python code as opposed to something more like RLHF where it's trained on thousands of examples of good and bad behavior.

I think that the latter is doomed and we can talk about exactly why I think that.

Liron: The inscrutable matrices are doomed. Yeah, unpack that a little bit.

Steven: So when people talk about inscrutable matrices, they're often talking about the learning algorithm. And I do expect that a brain-like AGI will still learn a world model that is inscrutable, just because, what else are you gonna do? It's a complicated world.

Tires are usually black, and garbage trucks are often blue. There's just so many things about the world, and you're not gonna be putting those in the source code. Instead, they need to be in some kind of learned model, and whether they're entries of a matrix or whatever else, they're probably gonna be inscrutable.

So there's really no getting around inscrutable matrices, or at least inscrutable learned models, from my perspective. I don't know why people talk as if there's another option.

But separately, the reward function might or might not involve learning algorithm components. So the RLHF reward function that's used in large language models is a learned function. So people thumbs up and thumbs down lots of different LLM behaviors and that's used to train this reward model.

So you could do that with brain-like AGI too. I think it would not lead to good outcomes. I think it would lead to psychopathic AI that tries to kill everybody.

Liron: Interesting. So there's three different types of popular reward functions, right? So the first is predict the next word in a giant corpus or predict the next token, even if it's video data or whatever.

And then the next one is humans giving you thumbs up and thumbs down and predict the shape of the high dimensional manifold of thumbs up and thumbs down, which is kind of inscrutable.

And then the third one is here's this Python code, which is a closed form expression of what stuff is worth, right? Those are the three categories?

Steven: I might call the first one a loss function rather than a reward function. The predictive self-supervised learning is usually called a predictive loss, and I think in brain-like AGI, it's more structurally different from the reward function than it is in LLMs.

In LLMs, it's sort of the same kind of update that you make during self-supervised learning, during the pre-training phase versus the RLHF phase. It's all just gradient descent, but I think in the brain it's kind of different.

In one case, you're updating certain types of cortex connections. In a different case, maybe you're updating the striatum or something.

Liron: So if I understand correctly, you're saying there's this new, more powerful paradigm of machine learning that we're going to get to, and that paradigm is just generally not going to go with reward functions that are messy. It's just going to fit nicely with neater reward functions. Is there some connection there?

Steven: I think that if you hook up an RLHF reward function to a sufficiently good RL system, it would also be really bad because it would find edge cases in the learned reward model, and relentlessly optimize them, including taking over the world as necessary. It would be really bad.

But the reason that you don't get that with RLHF right now - well, you could get that if you turn off the regularization or whatever. Sometimes there's KL divergence, or they modulate the learning rate or something to prevent it from going off the rails.

But when the LLMs are optimizing too hard on the RLHF reward function, they tend to do kind of stupid things like they'll output "bob bob bob bob bob" or something stupid. They won't make foresighted dangerous actions.

And that's where it becomes relevant that I think the human brain model-based RL system is just better. You give it some weird reward function and it will make good plans to optimize that reward function. It'll do foresighted planning. It'll come up with out of the box ideas.

It won't just go "bob bob bob bob bob". Instead it will take over the world and make Bob outputs or whatever. That's a bad example, silly example. But I think reward models are kind of inherently limited in that way.

And it's not just about brain-like AGI, but rather it interacts with the fact that brain-like AGI is a more powerful solution finder.

Liron: It's a more powerful solution finder. But you're saying the up votes and down votes could still work, especially in the early stages where it's not powerful enough and you're just giving it up votes and down votes. Wouldn't that still work the same way LLMs work?

Steven: Sure. But I'm interested in the solutions that will continue to work when the AI gets powerful.

Actor-Critic Model and Brain Architecture

Liron: Okay, so you're thinking ahead to this new regime, which I also think is likely where these things are basically magic wands, or what Eliezer calls outcome pumps. Are you a fan of that thought experiment?

Steven: I know what you're referring to. I think there's a lot of nuance and whether that's what we should expect and if there's steps we can take to avoid that.

Liron: Do you think that outcome pumps are - I mean, not literally outcome pumps, but computable approximations of outcome pumps - is a convergent feature that we're likely to slide into?

Steven: So an outcome pump, for listeners who aren't familiar, that's this thought experiment where there's a magic button where you type in an outcome and it travels through time to figure out the course of events that make that outcome actualize.

So we are all outcome pumps insofar as we have desires about what the world is like in the future, and we take actions to try to make those desires happen. But we also desire other things. We like to follow norms. Sometimes we don't care all that much about what happens in the future.

So I like to talk about AI, or more generally systems, that have desires about what the world is like in the future. You and I have desires about what the world is like in the future. Most humans do to some extent, and we have other desires too.

I want to eat lunch right now in addition to, you know, caring that my child grows up and has a good life and has grandchildren or whatever.

I expect that we will eventually build AIs that likewise have desires about what the world is like in the future and acts on those desires. So you could call that an outcome pump.

I mean, we'll also make AIs that don't. We're already making AIs that don't. But ultimately there are people who have desires about the future and they're trying to make - we want there to be a better solar cell. We want world peace, whatever. We want to get a Nobel Prize and a cool benchmark score that gets us into NeurIPS.

So people have been trying to make AIs that act as if they have desires about the state of the world in the future. And I expect that they'll keep trying that until they eventually succeed.

But then a separate question is will the AI act as if it only has desires about the future, or will it act as if it also has other kinds of desires too, like desires to follow norms, or desires to not rock the boat too much or whatever else.

And it's at least conceivable that we can make AIs that have these desires about the future, but also have other desires. And that could be a way to get away from the problem.

The downside of the outcome pump - if you have desires about the future, it leads to instrumental convergence, which as probably many listeners know, is the idea of if I want the world to be a certain way in the future, then what should I do right now?

I should gather resources, I should gather influence. I should take over foreign countries so that I can do with them as I please and continue to increase my wealth and influence. I'll be able to bribe people, I'll be able to - you name it, it helps to have tons of money acquired however it is, it helps to have a good reputation.

It helps to have all of these sorts of things. And so if you have desires about the future and you have incredible power, then you wind up with the AI that takes over the world in order to actualize its desires, and that's bad.

So I think desires about the future is the source of capabilities, and it's also the source of danger. And so that's sort of the needle that you have to thread. And I think part of threading that is that we can make AIs that also have other desires that are not just about the future.

Liron: I think to the degree that it does care, and actually the argument I want to make now is I always defend utilitarians. People are like, "oh Liron, you really think you're a utilitarian, so that means you don't care about people and love and feelings. You just only care about utility." I'm like, well, no, I just rolled the love and feelings into my utility. I still care about that stuff.

And I think that an aligned AI would look the same way, right? It would have all of these nice, fuzzy values as part of what it likes about the future state of the universe.

Steven: That would be nice if we made an AI that actually just wanted a good future. But there's also things like how it gets there. You could imagine an AI that wants a good future where humans are empowered, so it disempowers all humans and then gathers lots of resources, and then re-empowers humans later on.

But you can also imagine an AI that wants humans to never be disempowered in the first place. So that would be sort of - I guess you could call that deontological. But again, I don't know. It's easy to get into rolling up utilitarianism with making various claims that are deontological.

Liron: Okay. Well I want to tie this back to the concept of the Python code, which represents the reward function. Because when you think about solving the alignment problem and making the future go well, I think you're pretty focused on that piece, right?

I saw you wrote that you think that the AI companies are going to get in a position where they've got this new paradigm where they're using this new actor critic ML and suddenly they're getting better results. Instead of just a two week time horizon or whatever, now it's a two year time horizon and it's clearly an ASI takeoff, but they're going to have trouble with their reward function.

They're gonna see instrumental convergence and their AIs are going to be harder to control than ever. They keep doing the equivalent of "Mecha Hitler", right? They keep getting really hardcore Machiavellian on their masters and the utility function is just problematic.

And you want to basically be waiting in the wings, having done this research on what the ideal legible reward function is that they can just grab from the literature and it'll make their AI work better and steer the future better.

Steven: Yeah. I tend to think of this as being in the tradition of quote unquote RL research and AI as opposed to in the tradition of how RL has been used in the LLM era.

So if you look at AlphaGo or something, the reward function is this really straightforward thing that says that winning is good and losing is bad. If you look at OpenAI Gym and basically all the other quote unquote RL environments, they almost always have sort of simple, legible reward functions.

And I think that if people sort of continue in that tradition, they're gonna make RL agents that have simple, legible reward functions of various sorts. And I think that's gonna be catastrophically dangerous because if you put a sufficiently powerful RL system on some random reward function, like win at this race, or move the boxes or beat the Atari game or whatever it is, but something in the real world.

Then all of a sudden, turning off the program becomes a barrier to optimizing that reward function. So that would be really bad. And we want there to be better options in the RL paradigm for reward functions that don't lead to catastrophically dangerous behavior.

Liron: So I agree with you that we're pretty likely to get into a point in time where we can see the AIs that are optimizing the world toward legible goals. I think that's already kind of a good way to fail because I actually think that by the time they get to that point, we might have such a fast boom where they started developing better learning in their lab and they already embarked on a plan to take over the world even before we kind of noticed.

Even before we got to that point, it's too late. We're already dead. So for me, it's kind of interesting if we even live to the point where we can be like, "oh yeah, they're clearly trying to make more money than us and they're clearly trying to maximize the number of GPUs covering the planet."

I think we might actually watch them doing those kinds of legible goals. We'll have a chance to observe that and we might be able to identify a part of their code and even have the chance to edit it. We might actually be in that scenario. That's kind of your mainline good scenario, right, of we see where the levers of control are in this reward function and we have a chance at it.

So I agree with your premise that there's a significant chance that this may be the opportunity and the piece of the puzzle that we can shape. This is your research program. So I agree, this is a plausible good thing to focus on.

But then the problem is I just still think that whatever you put as that python reward function is almost certainly going to go awry.

Steven: So I don't claim to have a great plan right now, so time will tell whether I come up with something. I don't want to make some claim that a blocker doesn't exist until I actually have a good plan.

Maybe there's blockers that I still haven't appreciated as I spend more time trying to puzzle over possible reward functions. But I do want to say that I think human brains are built with these relatively simple social reward functions and humans are at least nice sometimes and at least not very obviously terrible.

So that's at least an existence proof that maybe there's a path here.

Liron: Okay. Yeah. Let me double click on that. You're saying you don't claim to have a high probability of success or a solid plan at this point in time. And it doesn't sound like you're that optimistic necessarily about the human species as a whole solving the problem.

Which brings me to the ultimate question, surprisingly late in the interview. Are you ready for this?

[song] P(Doom). What's your P(Doom)? What's your P(Doom)?

Dr. Steven Byrnes. What's your P(Doom)?

Steven: I don't know, 90% or something, but I kind of pulled that number out of my ass, so take it for whatever it's worth.

Liron: So high.

Steven: I think things seem pretty grim to me. There's - but of course we're working to take it better, so...

Liron: Okay. It's kind of funny because I normally get the P(Doom) question out of the way early and I kind of forgot and skipped it. We did this interesting interview and now we're getting close to the wrap up and we're like, "oh, by the way, my P(Doom)'s 90%."

Steven: Yeah. I think there's a lot of issues that go into that and we can talk about the different contributions.

Liron: Wow. Okay. So when you think about the success of your own program, that kind of implies that you think you have a pretty low chance of success?

Steven: I don't know. I think it's possible. I'm doing the best I can. I think there is a decent chance that I'll come up with some plan that kind of works in principle, but is hard and expensive to implement and very easy to go awry.

And then even if it's implemented right, it turns out that—we don't even know that these alleged bioengineered high IQ humans that you were talking about earlier in the interview would actually be good. Maybe with great power comes great corruption, even if you start with a good person.

I'm not sure it's that decision relevant, and I don't like how P(Doom) has become sort of this tribal marker that's blown out of proportion to its actual decision relevance. Which is not zero, obviously, you're getting lots of important information about people's stance towards the problem.

Liron: Yeah, I feel you. And certainly Eliezer has made himself clear that he's not a fan of P(Doom). For my part, the reason I bring up P(Doom) so often in this podcast is because I am trying to move the Overton window.

So people who watch episode after episode, they're like, "oh, wow. All of these incredibly intelligent people are telling us that there's a high chance that we're doomed. Hmm. Maybe the message will get through."

Steven: Yeah. And I'll just reiterate that that's not P(Doom) from LLMs next year, but rather P(Doom) from AI in my lifetime or something.

Current AI vs Future Paradigms

Liron: I want to get a couple things out of the way that I think we'll easily agree about. One of them is that when we talk about LLMs, I think you and I are both happy with the net positive contribution that LLMs made so far. Correct?

Steven: Yeah. LLMs, as of today seem - I like to bring up things like LLM assisted spear-phishing maybe being a potentially bad thing, and LLM virtual friends being potentially worrisome, but by and large, I think the LLMs are a cool technology.

And if LLM progress halted today, then I'd be happy and thank the person who invented them.

Liron: Exactly. And it seems like we could even get rid of the safety departments at the AI companies and sure you'd have a "Mecha Hitler" here, or a bias there, but whatever, that's just fine. We'll solve it.

Steven: Yeah. I mean, if we got rid of them, the companies would bring them back because they don't want "Mecha Hitler."

Liron: Exactly. So it's stating the obvious from our perspective, but just in case the audience is confused, it's not like you and I hate technology. We don't hate LLMs. We think everything is great. We just are seeing a little bit into the future where things are different and then we get destroyed.

Steven: Yeah, I mean, it helps that I'm not an illustrator who's at risk of losing their job.

Liron: The other thing I wanted to get out of the way that I think is probably going to be an easy yes, is just this question of is there lots of headroom above human intelligence?

Steven: Yes. Very much yes.

Liron: How much does being super intelligent really let you do? There's a lot of people that are like, "yeah, there's gonna be an AI that's so brilliant. Physics is so easy for it, but it still can't do that much more. It's still gonna be bottlenecked. The government's still gonna slow it down and regulate it."

Steven: I like to talk about scaling horizontally rather than vertically. So no human can be doing a billion things at once around the world, but you can have a billion copies of an AI that are running a whole civilization.

Also, the AIs will be able to think much faster than humans, I expect. And they'll be able to figure things out, the way that the sharpest humans are able to figure things out. They'll be as charismatic as the most charismatic humans. They'll be as persuasive as the most persuasive humans.

Liron: Okay, but hear me out. You have to perform experiments at a human timescale and that'll slow you down.

Steven: Yeah. I think, again, I don't think the AIs will be born already knowing everything about everything. I don't think that's really necessary for the AI takeover story.

The AIs can wipe out humans and then do lots of experiments to learn about how to do things. We shouldn't tell takeover stories that require the AI to know things that it has no way to already know.

Liron: Does Gödel's incompleteness theorem prove that machines can never have the level of insight that humans have? Just answer yes or no.

Steven: No.

Liron: How about this, if we had a super intelligent AI in a data center, but it didn't have a body, it didn't have physical actuators, could it still easily take over?

Steven: Yeah. Oh yeah, definitely.

Liron: Yeah. I mean, the short answer I give is it can harness millions of humans. It can just slide into everybody's DMs and start cajoling them or employing them. The humans are the actuators.

Steven: Yeah. I like to talk about if you take Joseph Stalin on one side, every other Russian combined on the other side, who would win in a fight? Obviously the millions of other Russians. And yet Stalin still wound up - I don't think he was a particularly strong person. He didn't have better weapons, but he still wound up with dictatorial, totalitarian control over Russia because he used charisma, he used threats, he used all of his human skills. He used his brain.

Liron: I think there were 30 million Russians that you can attribute their death to Stalin or so. At least 10 million. He's actually worse than Hitler if you just look at death count. And I'm Jewish, but I'm willing to admit Stalin was worse than Hitler in terms of moral valence.

And all of those 10 million Russians would've liked to see him dead, but they died.

Steven: Yep. Yeah. And that would be a lower bound for Stalin. Stalin only had a single human brain, but he was still able to do all that.

Liron: Okay, so we're obviously on the same page and you've explicitly said that you don't even think AIs have to keep humans around for an appreciable amount of time. It's not like, "oh, keep the humans around to run the world." You pretty much think, no, they'll just escape and they'll just run the world on their own.

Steven: I mean, that is what I think. I'm not sure how important that detail is. You can disagree with me on that and then tell a different story where the AI preempts any other equally powerful AI from being created and maneuvers its way into lots of political and other power over the world, over the course of however many years or decades it takes to build more and more robots. And then once it can run the world on its own, then it wipes out humans.

And that's not a more reassuring story, but it's not a story that requires this MacGyver thing where the AI has one robot body and that one robot body builds a second robot body and gathers more solar cells.

And I do think that MacGyver story is something that could happen, but it's not necessary to get to doom.

LLM Limitations and Capabilities

Liron: Let's go back to that question of what can't LLMs fundamentally do. People love talking about the subject because it's very interesting, right? We have these LLMs, everybody has been at least a little bit surprised about how many things they can do, and yet so many people are making so many claims about what they can't do.

And a lot of those claims seem to get disproven. Like there's the famous Scott Alexander versus Gary Marcus and other skeptics, where the skeptics are saying, "look, these images, they're hallucinated. You're never gonna have a horse riding a person, it's always going to draw a person riding a horse, even if you prompt it to do a horse riding a person."

But then the next version comes out and it successfully draws a horse piggybacking on a person. So there's all these back and forth debates about what these LLMs can't do and the goalposts seem to keep moving and you've kind of surprised me by taking a stand over what LLMs can't do.

And you use the example of you would be shocked, it would totally break your model if you could just have an LLM based system that could create a successful, profitable company.

Steven: Well, I mean there's these toy examples on the internet where the LLM is going like, "I'm an LLM" and then Marc Andreessen is like, "Hey, take my $10 billion” or something. And you could call that a business, but it's kind of a silly thing.

I think Jeff Bezos is still head and shoulders above LLMs at being able to do that.

Steven: So there's a thing that humans can do. If I say, imagine a purple colander with a bird on top, falling out of an airplane. And you can just immediately create that image in your head and I can ask you follow up questions about it, and you'll be able to answer those questions very skillfully, leveraging all of your understanding of aerodynamics and airplanes and birds.

Is that a creative thing that you just did? Maybe. And it's definitely this kind of thing that LLMs can do. It involves remixing existing ideas in a way that has never happened before, so far as I know.

There's the thing where LLMs fail in really quite baffling ways sometimes, and it's hard to know exactly what to make of LLMs saying things that are stupid and in a very inhuman and baffling way.

Liron: But when you put it that way, it suggests that maybe there's ways to work around it being like, "okay, well what if I just use two prompts, right? Where it's like half the stuff in each prompt and then merge them."

Steven: Yeah. I'm sure people will try whatever possibilities they can think of, and I just think that this is a problem with the LLM architecture.

Liron: Another way for me to ask the question about where you think the LLM barrier is - what do you think is the least impressive thing that AI almost certainly still can't do in two years?

Steven: I'm reluctant to make a strong stance on this, first of all, because I'm not an LLM power user myself. And second of all, I think - I'm not asking people to believe me when I say that LLMs will hit a wall. I'm sort of making a research bet, but I'm trying not to screw people over in the world where I'm wrong about LLMs hitting a wall.

And third of all, so for humans, as a human works on a problem over the course of days, weeks, months, years, they pile more and more idiosyncratic domain knowledge about this particular factory - knowledge that's not on the internet - into their brain.

And LLMs are, I think, unable to do that kind of thing. But what they can do with the context window has been incrementally increasing. I don't want to make a strong stand about exactly where the ceiling is gonna be in two years.

Liron: Yeah. I mean, I don't exactly disagree with you that LLMs are limited, right? I can see it too. The most surprising thing to me is that you'll have these state-of-the-art systems that are giving you such insightful advice about medical conditions or analyzing human relations, like really deeply insightful stuff, like better than humans, no question.

Then you'll ask them a very simple question, like even some simple arithmetic, and they'll confidently give you an obviously wrong answer. That's still shocking even to this day. So I agree. There's something up. There's something to be explained here.

Maybe there is some fundamental limitation that has to do with robustness. It's just that the LLM model is so flexible. Anytime you notice something wrong, a lot of times you can patch it by being like, "okay, well just point them at this giant terabyte of data and then the problem will be fixed."

And really, you just have to point them at a ton of data and that was pretty easy. Now we're talking about in the future, there will be a few humans doing something and pioneering it, and then the AI will take over rapidly. And you don't think we're that close to automating those humans, to just patching our way to not even need the humans.

Steven: So if the AI has been running this factory for six months, there isn't any pile of data to point an AI at for all of the details about this particular factory - details that don't exist on the internet, like this particular way that this particular machine works with this particular set of employees.

That's sort of a humdrum example, but of course people are more interested in things like inventing a new scientific paradigm. Like if there's - I don't know, take some random physics concept like quantum capacitance and say that if the LLM doesn't "understand" all the nuances of quantum capacitance.

How are you gonna create more data for the LLM? The way to create data is to understand quantum capacitance. So that's sort of this chicken and egg problem where I don't know where you're supposed to get that data from.

I'm sure that AI companies, as we speak, are trying to solve that puzzle, and I wish them failure, but we'll see how it goes.

Inner vs Outer Alignment

Liron: I want to touch on outer versus inner alignment, because you've explained that well and it was actually a little bit different than how I've been thinking about it.

So I think we both agree that inner misalignment means goal misgeneralization, meaning you're getting these signals, you're getting up votes and down votes or some other kind of rewards signals and you think that you know how to optimize the signals you're going to get.

But the model you've built of how to optimize them turned out to be very different from the model of the generator of the signals.

Steven: Yeah, or at least what was intended. I should clarify that outer and inner misalignment come in different variations depending on the AI paradigm. I think this is a useful way to think about it for actor-critic reinforcement learning.

So inner misalignment would be when the critic is - so the critic is the thing, remember that I think I'm gonna close the window, and the critic is guessing based on life experience, that that's a good idea or a bad idea.

So the critic is taking this sort of rich semantic plan involving the world model, and guessing whether it's gonna be good or bad according to the reward function, which is the sort of dumber thing that doesn't understand everything about the world.

So you can get mismatches where the critic thinks that something is good or bad in a way that does not match how the reward function would be rewarding that plan or punishing that plan.

I like to give examples like, if I did cocaine, then it would be very rewarding. But I don't do cocaine 'cause I don't want to get addicted. So that's kind of a form of inner misalignment in that the reward function would say that cocaine is a good idea, but the critic is saying that cocaine is a bad idea, so I don't do it.

Liron: So if I understand correctly, inner misalignment can't happen during training. It's the difference between how the AI is going to behave in training versus production. Because in production it thinks that it's going to get a high reward for doing cocaine or whatever, it thinks that there's some actions like, "oh, this is the best action I'm going to get such a high reward."

And if it was back in training, it would've actually gotten a low reward, but it misgeneralized.

Steven: It's a little different for me 'cause I'm thinking about these sort of continuous learning paradigms where you never stop doing the RL. I'm doing the RL in my own brain right now. I've been doing it for my whole life.

So if you have an AI like that, then it's true that as soon as you do something, it's no longer out of distribution and it's no longer a generalization thing. But inner misalignment can still be a big issue if, for example, I'm not taking cocaine because I don't want to get addicted, that's a form of inner misalignment, but it's not gonna get corrected because I never do take the cocaine.

And a more scary example would be if I go in to get surgery to change my reward function or something, or I start taking some drug like Ozempic or something that changes my reward function. That would not get corrected. But by the time it would get corrected, it's too late.

Liron: Now when you use the cocaine example, I feel like it's misleading because we generally think of cocaine being a bad outcome to get addicted to cocaine. But you're saying that your emotions actually intend to reward you for doing cocaine and you're missing out on the prediction of that signal.

Steven: Yeah. This is an important point. So we're usually thinking about it from our own perspective and from our own perspective, inner misalignment is a good thing because we're painting the target around the arrow. The things that we want are the things that we want. And if our reward function wants something different, then so much the worse for our reward function.

But if we put on the shoes of being an AGI programmer and imagining that we have put something in the reward function that we're expecting to steer the motivations of our AI, and the AI says, "Hey, you know what I'm gonna do? I'm gonna edit my own reward function. I'm gonna create a subagent that has a different reward function, or whatever else." Some irreversible action.

That could be problematic because it makes it hard to reason about what the AI's gonna do, and maybe it's gonna be doing things that we didn't want it to do.

Liron: Totally. Okay, so moving on to outer alignment. You call that specification gaming and reward hacking. So the idea there is that the agent maybe knows perfectly well - it is good at predicting what reward it's going to get. It's just that it predicts that it could do all these counterintuitive moves like taking over the world or killing the operator or whatever.

It's correctly predicting that it can do these undesirable moves and then maximize its reward from the reward function.

Steven: Yeah, that's right. So this is a classic thing in the reinforcement learning literature. People have kind of forgotten about it in the LLM era, but if you reward for getting a high score in a buggy game, then the AI often learns to exploit the bugs.

And maybe you didn't intend that, but if you go into the real world with all of its complexities, often a good way to get a high score - reward for the human being happy - then maybe it's gonna lock you in prison on a heroin drip.

Liron: Now what you're calling inner and outer misalignment. I think that's a really good distinction. That makes sense. But I think that I've been lumping them both together into inner misalignment because I think there's yet another layer of outer alignment, which is we as the human race, even knowing what to write down as the reward function.

Steven: Yeah, I guess I kind of lumped that in with other parts of the safe and beneficial AGI problem, which I'm definitely very interested in, and it all has to tie into a coherent story.

It's easy to write a book that says this is what we want AGI to do. But that's no good if we don't actually know how to write code for that book. And there's sort of constraints that all connect to each other.

If there's only so many things that we can program an AGI to do, then we have to work around that when we're trying to brainstorm what we actually want the AGI to be doing in the first place.

AI Policy and Pause AI Discussion

Liron: Let's talk about AI policy. I'm a member of Pause AI. I just think we should probably pause AI, try to keep it at the current level, try to do intelligence augmentation, or at least do more research on current AI.

I'm not thrilled with this proposal. I'm not excited. I wish that I could just see more cool tech coming out, but I just think it's kind of crazy to go forward at this point.

Now you've written that you're actually not that bullish on Pause AI as a proposal, even though you have a 90% P(Doom). So explain yourself.

Steven: Yeah, I mean, I think that LLMs are gonna plateau before they get to a very powerful and scary and dangerous AI. And it's gonna be some future paradigm, and some future paradigm that requires a lot less compute.

So I think that the central bottleneck between us and the scary AI is people playing around with toy models and publishing on arXiv and GitHub. And I don't think that the sort of Pause AI actions so far are gonna be effective at dissuading those researchers.

I guess I should first say that I think on the margin it would be good to buy time. It would be very good to buy time before super intelligence because I think we are not ready for super intelligence, but we're making incremental progress.

I don't think that buying time is the only thing or the most important thing. I would just as soon have more people do good research in this area. But it is a thing that I think would be good on the margin, on the current margin. (If AI was 10,000 years away, then I'd be like, "oh, well, on that margin I want AI to come sooner," but on the current margin, I would want it to be later.)

So in that sense, I'm sort of ideologically aligned with Pause AI's goals, but I think that actually pausing AI involves pausing this kind of tinkering around in research that seems very hard and impractical to actually slow down.

Liron: In other words, you're on the same page as me, that it would be great to choke off AI development if we could find a choke point. But the problem is that the only choke points people are talking about is training runs, right?

If it's a large training run or an expensive training run, then it's illegal to do it. That's what everybody's talking about. That's what I actually support.

But I think you're correctly pointing out that that's not that great of a choke point because even just a random guy in his lab, an independent researcher such as yourself. Hmm, what are you up to Steve? But an independent researcher could just be getting us to that breakthrough, the next breakthrough and the choke point just doesn't do anything.

In fact, the choke point might even redirect some effort toward the more dangerous stuff.

Steven: Yeah, potentially. Yeah. That's basically where I'm coming from.

Liron: So you even had an argument that Pause AI is net negative if it redirects effort.

Steven: I think that kind of effect rounds to zero. And mostly I just sort of feel neutral about it.

Liron: Are there any other policies that you wish people would adopt?

Steven: I don't really have any strong takes on that. I usually leave that to the policy people.

Liron: Yeah. Honestly, same here. The point of this show is basically just fear mongering, just to make people realize, "Hey guys, we're doomed." And then people ask, "so what should we do?" I'm like, "I don't know if I've gotten that far."

But I just think most people are walking around not even knowing that we're doomed. And I feel like I can do my part by at least opening their eyes to the doom.

Steven: Yeah. You're doing good work.

Liron: Thanks. Do you think that fear mongering is good?

Steven: I would like people to have accurate beliefs about the future of AI and they can feel how they feel about those accurate beliefs. But I think that the accurate beliefs would involve that we are on track to make AI that wants to kill its programmers and kill its users and kill everybody else, and that it's a hard technical problem and that it's not the kind of thing we can just iterate our way out of when it's already happening, for lots of deep reasons about safety being hard to test and irreversible problems.

I think that we are inviting this new intelligent species onto the planet, and nobody's asking permission, they're just doing it and it's gonna be this species of psychopathic AIs, and this is really bad.

And I do think that it would be good if people knew about that, insofar as that's true.

Liron: Yep. Okay. Well then riddle me this, you know, you're talking in a calm voice, you're having a good time getting interviewed, but do you feel terrified?

Steven: I think my mood is not mostly dependent on that. I'm still more upset by my kid cutting his knee.

Every now and then I'll shed a tear about the coming doom. And how hard these things are to solve, and that I'm not making more progress. And I'll feel frustrated and I'll feel other things. But mostly on a day-to-day basis, I'm just trying to do the best I can, and live my life.

Liron: Yeah, totally. I was jogging on the road the other day and there's no sidewalk and the cars were passing pretty close and I'm like, the fear that I'm doing in terms of how intuitively scared I should be right now, I should be way scareder than handling a snake right now in terms of how much bravery and recklessness I'm showing by running close to these cars, but my intuition hasn't really caught up.

I don't really have an intuition to be like, "oh my God, running close to a car is so scary." That's like a tiger bearing its jaws at you. My hypothalamus and brainstem just haven't gotten the message yet 'cause I haven't had enough generations to evolve that.

Steven: Yeah, exactly.

Liron: So similarly it's like, okay, we're all doomed from AI, but I just don't have that firmware to be scared.

Steven: Yeah. If there's a spider crawling up your leg, that's gonna be a stronger reaction than me announcing on a podcast that your most likely cause of death is AI.

Lightning Round

Liron: Okay. You ready for the lightning round?

Steven: Sure.

Liron: Alright, so offense, defense, balance, where do you stand? Which side's gonna win?

Steven: I tend to be very strongly on the offense side.

Liron: I think so too. And I don't know if there's one simple answer, but what about this? You're a physicist. Great. So I'm gonna ask you the physics aspect, the low level physics aspect.

Don't you think that the ability to dump energy into any cube of space is a huge advantage for the attacker?

Steven: Yeah. It's easier to destroy than to create. There's also a thing where power begets power if you're sufficiently not norm-following.

In the zombie apocalypse movies, you have one zombie that makes two zombies, that makes three zombies. I'm expecting AI to be running one inference copy per chip as opposed to per data center.

So as soon as the AI hacks into another chip, it's doubling its strength and then it gets another chip, it doubles its strength again, another two chips. So that's another sort of way that the world is unstable.

Liron: So when we ask whether offense or defense wins, maybe that's equivalent to asking whether the best defense is a good offense, right? Because it kind of feels like the answer is yes, because if you have really good defenses, but your borders are small, how good are your defenses really going to be?

Isn't it going to be super tempting strategically to increase your borders? And in that case, doesn't that just mean offense is more powerful than defense?

Steven: Yeah, I mean, there's sort of mutually assured destruction and stuff. I don't know. There's a whole discourse on it. This is a short answer as opposed to going into length about all the back and forth.

Liron: I think you and I can both agree that intelligence is going to fly beyond human levels so quickly that at the end of the day it doesn't really matter what these physical properties are in terms of who has a fundamental advantage because the AI is just going to plow its way toward engineering a solution either way.

Steven: Yeah. I'm expecting one AI that has the ability to take over the world and run the world by itself if it wants to, and then the hope is that its motivations are gonna be good.

Liron: What is your P(simulation)?

Steven: I think that we live in a real world, it's not a world that's simulated by some other intelligent entity. But I don't know how to assign a probability to that.

Liron: Come on you gotta give me a ballpark here.

Steven: If I were to pull a number out of my ass, I guess I would say 1% or something.

Liron: Wow. Only one. Okay. I'm 50. Wow. Okay. We should have had the whole debate about that.

And actually, my P(Doom) is only 50%. I'm 50% P(Doom). 50% P(simulation). You're 90% P(Doom). 1% P(simulation). That's funny.

Steven: We got lots of disagreements. That's fine.

Liron: Alright, last lightning round question. Thoughts on AI mass unemployment or as some call it, gradual disempowerment.

Steven: I endorse the Eliezer quote where he said worrying about the impact of machine super intelligence on the employment market is like worrying about the impact on US China trade relations of the moon crashing into the Earth. Yes, there would be an impact, but you're kind of missing the point.

I think ASI is gonna be this new intelligent, radical species on our planet that's going to probably wipe out all humans.

And if not, then losing our job is the last thing that I think people should have on their mind.

Liron: Yeah. We both agree that that's a good problem to have, 'cause it means we haven't all died.

Steven: Yeah.

Liron: The only way that I've changed my mind though, is I think that even if we end up having this good problem to have, it's actually not an easy fix. It's not like, "oh, just give everybody universal basic income and give them hobbies."

It's not an easy fix because I actually just think that once we're fortunate enough to be dealing with gradual disempowerment, we're still going to get disempowered.

Steven: If you have an AI singleton, then you're not necessarily subject to the types of competitive races-to-the-bottom that I understand were the focus of the Gradual Disempowerment paper.

And I'm not thrilled about the idea of an AI singleton. I think it's kind of terrifying, but for better or worse, that is actually what I'm expecting right now.

Liron: Yeah, it's interesting to observe that your good scenario, which again, you're not super confident about it, but at least you're giving it a shot. It's a good scenario where we have these powerful autonomous AIs that have internalized human social instincts and basically are good, but are also quite autonomous and independent.

And in that scenario they can just help give us a share of the power.

Steven: Yeah, keep us in metaphorical zoos.

Liron: But a better version of a zoo, better version of a zoo. You have enrichment.

Steven: It's only mildly dystopian. As opposed to...

Liron: That could be another doom scenario - gradual zoo animalness, where it's like, "okay, you're getting everything you want, but you just have to live with the idea that you're ultimately somebody else's pet."

Steven: I mean, it's just hard to imagine that humans are gonna ultimately wind up in control of the fate of the universe, when it would be like putting a 5-year-old in charge of the Fed.

There are people who have much more experience, who are more thoughtful and more intelligent and more confident in every imaginable way by a giant factor. So why would you put a 5-year-old in charge of the Fed?

So if AI doesn't kill us all, then I imagine we're gonna be in this position where AI is gonna be in a much better position to make decisions.

Closing Thoughts

Liron: Viewers, so this has clearly been a very long and wide ranging conversation. Steve is a real well of knowledge. If you want more of those knowledge drops, just go to his website. It's linked in the show notes. There's so many good articles. We're just scratching the surface in this conversation.

And I think you can also see Steve is knowledgeable about so many different things that when he makes some progress in his research or when the AI world is making some progress and more new things are coming up, he's definitely going to be somebody that I want to turn to to ask him more questions and help make sense of this kind of stuff.

And then finally, the thing to point out about Steve is that he's actually doing a direct shot at the AI alignment problem in a way that's not obviously wrong and hopeless. He's actually taking a serious shot and he may actually be one of the top people in the world who could actually help humanity solve this problem.

So Steve, thanks for doing that and thank you for coming on Doom Debates.

Steven: Thanks so much for the invitation. I had a great time.

Doom Debates’s Mission is to raise mainstream awareness of imminent extinction from AGI and build the social infrastructure for high-quality debate. Previous guests include Carl Feynman, Geoffrey Miller, Robin Hanson, Scott Sumner, Gary Marcus, Jim Babcock, and David Duvenaud.

Discuss