How do we solve the alignment problem?

Published on February 13, 2025 6:27 PM GMT

(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.)

We want the benefits that superintelligent AI agents could create. And some people are trying hard to build such agents. I expect efforts like this to succeed – and maybe, very soon.

But superintelligent AI agents might also be difficult to control. They are, to us, as adults to children, except much more so. In the same direction, relative to us, as advanced aliens; as demi-gods; as humans relative to ants. If such agents “go rogue” – if they start ignoring human instructions, resisting correction or shut-down, trying to escape from their operating environment, seeking unauthorized resources and other forms of power, etc – we might not be able to stop them.

Worse, because power/resources/freedom/survival etc are useful for many goals, superintelligent agents with a variety of different motivations would plausibly have incentives to go rogue in this way, suggesting that problems with AI motivations could easily lead to such behavior. And if this behavior goes uncorrected at scale, humans might lose control over civilization entirely – permanently, involuntarily, maybe violently. Superintelligent AI agents, acting on their own, would be the dominant actors on the planet. Humans would be sidelined, or dead.

Getting safe access to the benefits of superintelligence requires avoiding this kind of outcome. And this despite incentives among human actors to build more and more capable and agentic systems (and including: to do so faster than someone else), and despite the variety of actors that might proceed unsafely. Call this the “alignment problem.”

I’ve written, before, about why I’m worried about this problem.^[1] But I’ve said much less about how we might solve it. In this series of essays, I try to say more.^[2] Here’s a summary of the essays I’ve released thus far:

“What is it to solve the alignment problem?”

solving

and

I may add more overall remarks here later. But I think it’s possible that my perspective on the series as a whole will change as I finish it. So for now, I’ll stick with a few notes.

First: the series is not a solution to the alignment problem. It’s more like: a high-level vision of how we get to a solution, and of what the space of possible solutions looks like. I, at least, have wanted more of this sort of vision over the years, and it feels at least clearer now, even if still disturbingly vague. And while many of my conclusions are not new, still: I wanted to think it through, and to write it down, for myself.

Second: as far as I can currently tell, one of the most important sources of controllable variance in the outcome, here, is the safety, efficacy, and scale of frontier AI labor that gets used for well-chosen, safety-relevant applications – e.g., alignment research, monitoring/oversight, risk evaluation, cybersecurity, hardening-against-AI-attack, coordination, governance, etc. In the series, I call this “AI for AI safety.” I think it’s a big part of the game. In particular: whether we can figure out how to do it well; and how much we invest in it, relative to pushing forward AI capabilities. AI companies, governments, and other actors with the potential to access and direct large amounts of compute have an especially important role to play, here. But I think that safety-focused efforts, in general, should place special emphasis on figuring out how to use safe AI labor as productively as possible – and especially if time is short, as early as possible – and then doing it.

Third: the discussion of “solutions” in the series might create a false sense of comfort. I am trying to chart the best paths forward. I am trying to figure out what will help most on the margin. And I am indeed more optimistic about our prospects than some vocal pessimists. But I want to be very clear: our current trajectory appears to me extremely dangerous. We are hurtling headlong towards the development of artificial agents that will plausibly be powerful enough to destroy everything we care about if we fail to control their options and motivations in the right way. And we do not know if we will be able to control their options and motivations in the right way. Nor are we on any clear track to have adequate mechanisms and political will for halting further AI development, if efforts at such control are failing, or are likely to fail if we continue forward.

And if we fail hard enough, then you, personally, will be killed, or forcibly disempowered. And not just you. Your family. Your friends. Everyone. And the human project will have failed forever.

These are the stakes. This is what fucking around with superintelligent agents means. And it looks, to me, like we’re at serious risk of fucking around.

I don’t know what will happen. I expect we’ll find out soon enough.

Here’s one more effort to help.

This series represents my personal views, not the views of my employer.

Thanks to Nick Beckstead, Sam Bowman, Catherine Brewer, Collin Burns, Joshua Clymer, Owen Cotton-Barratt, Ajeya Cotra, Tom Davidson, Sebastian Farquhar, Peter Favaloro, Lukas Finnveden, Katja Grace, Ryan Greenblatt, Evan Hubinger, Holden Karnofsky, Daniel Kokotajlo, Jan Leike, David Lorell, Max Nadeau, Richard Ngo, Buck Shlegeris, Rohin Shah, Carl Shulman, Nate Soares, John Wentworth, Mark Xu, and many others for comments and/or discussion. And thanks to Claude for comments and suggestions as well.

^{^}
In 2021, I wrote a report about it, and on the probability of failure; and in 2023, I wrote another report about the version that worries me most – what I called “scheming.”
^{^}
Some content in the series is drawn/adapted from content that I've posted previously on LessWrong and the EA Forum, though not on my website or substack. My aim with those earlier posts was to get fast, rough versions of my thinking out there on the early side; here I'm aiming to revise, shorten, and reconsider. And some of the content in the series is wholly new.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签