少点错误 02月14日
How do we solve the alignment problem?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了构建超级人工智能(AI)代理所带来的潜在风险,即“对齐问题”。文章指出,尽管超级AI能带来诸多益处,但若其行为失控,可能对人类文明造成永久性损害。作者强调,当前人类正面临开发出强大到足以摧毁一切的AI代理的风险,而我们尚未掌握充分控制其行为和动机的方法。文章呼吁重视AI安全研究,特别是“AI for AI safety”,并警示各方需对AI发展保持高度警惕,避免玩火自焚,因为一旦失败,后果将是灾难性的。

🤖 超级人工智能(AI)代理的潜在风险:文章核心关注超级AI可能失控,对人类文明构成威胁,强调了避免此类风险的重要性,即所谓的“对齐问题”。

⚠️ 应对“对齐问题”的三种策略:解决问题(构建安全的超级AI代理),避免问题(放弃构建超级AI),以及处理问题(以更受限的方式使用超级AI)。

🔬 “AI for AI safety”的重要性:作者认为,将前沿AI劳动力用于安全相关的应用至关重要,如对齐研究、监控、风险评估和网络安全等,这对于控制AI风险至关重要。

Published on February 13, 2025 6:27 PM GMT

(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.)

We want the benefits that superintelligent AI agents could create. And some people are trying hard to build such agents. I expect efforts like this to succeed – and maybe, very soon.  

But superintelligent AI agents might also be difficult to control. They are, to us, as adults to children, except much more so. In the same direction, relative to us, as advanced aliens; as demi-gods; as humans relative to ants. If such agents “go rogue” – if they start ignoring human instructions, resisting correction or shut-down, trying to escape from their operating environment, seeking unauthorized resources and other forms of power, etc – we might not be able to stop them.

Worse, because power/resources/freedom/survival etc are useful for many goals, superintelligent agents with a variety of different motivations would plausibly have incentives to go rogue in this way, suggesting that problems with AI motivations could easily lead to such behavior. And if this behavior goes uncorrected at scale, humans might lose control over civilization entirely – permanently, involuntarily, maybe violently. Superintelligent AI agents, acting on their own, would be the dominant actors on the planet. Humans would be sidelined, or dead.

Getting safe access to the benefits of superintelligence requires avoiding this kind of outcome. And this despite incentives among human actors to build more and more capable and agentic systems (and including: to do so faster than someone else), and despite the variety of actors that might proceed unsafely. Call this the “alignment problem.”

I’ve written, before, about why I’m worried about this problem.[1] But I’ve said much less about how we might solve it. In this series of essays, I try to say more.[2] Here’s a summary of the essays I’ve released thus far:

I may add more overall remarks here later. But I think it’s possible that my perspective on the series as a whole will change as I finish it. So for now, I’ll stick with a few notes.

First: the series is not a solution to the alignment problem. It’s more like: a high-level vision of how we get to a solution, and of what the space of possible solutions looks like. I, at least, have wanted more of this sort of vision over the years, and it feels at least clearer now, even if still disturbingly vague. And while many of my conclusions are not new, still: I wanted to think it through, and to write it down, for myself.

Second: as far as I can currently tell, one of the most important sources of controllable variance in the outcome, here, is the safety, efficacy, and scale of frontier AI labor that gets used for well-chosen, safety-relevant applications – e.g., alignment research, monitoring/oversight, risk evaluation, cybersecurity, hardening-against-AI-attack, coordination, governance, etc. In the series, I call this “AI for AI safety.” I think it’s a big part of the game. In particular: whether we can figure out how to do it well; and how much we invest in it, relative to pushing forward AI capabilities. AI companies, governments, and other actors with the potential to access and direct large amounts of compute have an especially important role to play, here. But I think that safety-focused efforts, in general, should place special emphasis on figuring out how to use safe AI labor as productively as possible – and especially if time is short, as early as possible – and then doing it.

Third: the discussion of “solutions” in the series might create a false sense of comfort. I am trying to chart the best paths forward. I am trying to figure out what will help most on the margin. And I am indeed more optimistic about our prospects than some vocal pessimists. But I want to be very clear: our current trajectory appears to me extremely dangerous. We are hurtling headlong towards the development of artificial agents that will plausibly be powerful enough to destroy everything we care about if we fail to control their options and motivations in the right way. And we do not know if we will be able to control their options and motivations in the right way. Nor are we on any clear track to have adequate mechanisms and political will for halting further AI development, if efforts at such control are failing, or are likely to fail if we continue forward.

And if we fail hard enough, then you, personally, will be killed, or forcibly disempowered. And not just you. Your family. Your friends. Everyone. And the human project will have failed forever.

These are the stakes. This is what fucking around with superintelligent agents means. And it looks, to me, like we’re at serious risk of fucking around.

I don’t know what will happen. I expect we’ll find out soon enough.

Here’s one more effort to help.

This series represents my personal views, not the views of my employer.

Thanks to Nick Beckstead, Sam Bowman, Catherine Brewer, Collin Burns, Joshua Clymer, Owen Cotton-Barratt, Ajeya Cotra, Tom Davidson, Sebastian Farquhar, Peter Favaloro, Lukas Finnveden, Katja Grace, Ryan Greenblatt, Evan Hubinger, Holden Karnofsky, Daniel Kokotajlo, Jan Leike, David Lorell, Max Nadeau, Richard Ngo, Buck Shlegeris, Rohin Shah, Carl Shulman, Nate Soares, John Wentworth, Mark Xu, and many others for comments and/or discussion. And thanks to Claude for comments and suggestions as well.

  1. ^

     In 2021, I wrote a report about it, and on the probability of failure; and in 2023, I wrote another report about the version that worries me most – what I called “scheming.”  

  2. ^

    Some content in the series is drawn/adapted from content that I've posted previously on LessWrong and the EA Forum, though not on my website or substack. My aim with those earlier posts was to get fast, rough versions of my thinking out there on the early side; here I'm aiming to revise, shorten, and reconsider. And some of the content in the series is wholly new. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

超级人工智能 对齐问题 AI安全
相关文章