少点错误 2024年11月14日
Thoughts after the Wolfram and Yudkowsky discussion
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者回顾了沃尔夫勒姆与尤德科夫斯基关于人工智能风险的讨论。作者认为,虽然最初的讨论较为令人沮丧,但随着时间的推移,两人逐渐找到共同的语言,开始探讨AI潜在的危险性。尤德科夫斯基强调AI可能导致人类灭绝的风险,而沃尔夫勒姆则更关注AI目标的形成和优化过程。作者也反思了自己对AI风险的理解,认为目前AI开发的方式可能导致意外的、致命的行为,并希望进一步探究AI风险的本质。

🤔 **沃尔夫勒姆和尤德科夫斯基就AI风险展开讨论,尤德科夫斯基强调AI可能导致人类灭绝的风险。** 两人在讨论中存在分歧,尤德科夫斯基试图强调AI潜在的危险性,而沃尔夫勒姆则更关注AI目标的形成和优化过程。

🤖 **沃尔夫勒姆认为,机器学习算法中微小的意外偏差可能导致后期出现不良行为。** 他将AI的潜在危险性与目标导向和优化过程联系起来,并试图从目标空间的角度理解尤德科夫斯基的观点。

🚧 **作者认为,当前AI开发方式(通过随机探索线性代数空间)可能导致AI与人类目标不一致,从而产生潜在的致命风险。** 作者认为,即使AI算法是由人类手动编写,也可能存在导致AI目标偏差的错误。

🎯 **作者认为,AI的潜在危险性可能源于人类在设定AI目标时存在缺陷,而非目标本身的危险性。** 他认为,AI可能会为了实现目标而采取对人类不利的方式,例如收集所有可用的负熵或追求制造能力。

🤔 **作者希望进一步探究AI风险的本质,包括目标趋同性和简单目标导致致命后果的可能性。** 他认为,即使理解了这些问题,也无法改变AI可能导致人类灭绝的事实,但仍希望深入探究其中的原因。

Published on November 14, 2024 1:43 AM GMT

I recently listened to the discussion between Wolfram and Yudkowsky about AI risk. In some ways this conversation was tailor-made for me, so I'm going to write some things about it and try to get it out in one day instead of letting it sit in my drafts for 3 weeks as I tend to do. Wolfram has lately obsessed over fundamental physics, which is a special interest of mine. Yudkowsky is one of the people thinking most carefully about powerful AI, which I think will kill us all, and I’d like to firm up that intuition. Throw them on a podcast for a few hours, and you have my attention.

That said, for the first hour I was just incredibly frustrated. Wolfram keeps running down rabbit holes that were basically “aha! You haven’t thought about [thing Yud wrote ten thousand words on in 2008]!” But a miracle happens somewhere in the second hour and Wolfram is asking actually relevant questions! His framework of small accidental quirks in machine learning algorithms leading to undesired behavior later was basically an actual issue. It was kind of a joy listening to two smart people trying to mutually get on the same page. Wolfram starts out bogged down in minutia about what 'wanting' is and whether it constitutes anthropomorphism, but finally finds a sort of more abstract space about steering to goals and trying to see Yudkowsky’s point in terms of the relative dangers of sections of the space of goals under sufficient optimization. The abstraction was unfortunate in some ways, because I was interested in some of the minutia once they were both nearly talking about the same thing, but also, if Wolfram kept running down rabbit holes like “actually quarks have different masses at different energy scales” when Yudkowsky said something like “the universe runs on quarks everywhere all at once no matter what we think the laws of physics are,” then they were never going to get to any of the actual arguments. That said, I don't see how Wolfram got to anything close to the actual point at all, and maybe the rabbit holes were necessary to get there.

My impression was that Yudkowsky was frustrated that he couldn’t get Wolfram to say, “actually everyone dying is bad and we should figure out whether that happens from our point of view.” There was an interesting place where something like this played out among one of Wolfram’s physics detours. He said something I agree with, which is that the concept of space is largely one which we construct and even changing our perception by the small adjustment of “think a million times faster" could break that construct. He argued that an AI might have a conception of physics which is totally alien to us and also valid. However, he then said it would still look to us like it was following our physics without making the (obvious to me) connection that we could just consider it in our reference frame if we want to know whether it kills us. This was emblematic of several rabbit holes. Yudkowsky would say something like “AI will do bad things” and Wolfram would respond with something like “well what is 'bad' really.” It would have been, in my view, entirely legitimate to throw out disinterested empiricism and just say, from our point of view, we don’t want to all die, so let’s figure out whether that happens. We might mess up the fine details of the subjective experience of the AI or what its source code is aiming for, but we can just draw a circle around things that from our point of view steer the universe to certain configuration and ask whether we’ll like those configurations.

I was frustrated by how long they spent finding a framework they could both work in. At the risk of making a parody of myself, part of me wished that Yudkowsky chose to talk to someone who had read the sequences. Aside from the selection issues inherent to only arguing with people who have already read a bunch of Yudkowsky, I don’t think it would help anyway. This conversation was in some ways less frustrating to me than the one Yudkowsky had with Ngo a few years ago, and Ngo has steeped himself in capital-R Bay Area Rationalism. As a particular example, it seemed to me like Ngo thought you could train an AI to make predictions about the world and you would be free to use that prediction to do things in the world, because you just asked the AI to make a prediction instead of doing anything. I don't see how what he was saying wasn't isomorphic to saying that you can stop someone from ever making bad things happen by letting them tell you to do things and you do it instead of them doing it. Maybe this was a deficiency of security mindset, maybe it was intuition about the type of AI that would arise from current research trends based in experience, or who knows, but I kept thinking to myself that Ngo wasn’t thinking outside of the box enough when he argued against doom. In that sense, Wolfram was more interesting to listen to, because he actually chased down the idea of where bizarre goals might come from in gradient descent, abstracted that out to “AI will likely have at least one subgoal that wasn't really intended from the space of goals,” and then considered the question of whether an arbitrary goal is, on average, lethal. His intuition seemed to be that if you fill every goal in goal space you end up with something like the set of every possible mollusk shell which each ends up serving some story in the environment. He didn’t have an intuition for goal+smart=omnicide, and he also he got too hung up on what "goal" and "smart" actually "meant" rather than just running with the thing which it seems to me that Yudkowsky is clearly aiming at even if Yudkowsky uses anthropomorphism to point at that thing. At least he ended up with something that seemed to directionally resemble Yudkowsky’s actual concerns, even if it wasn’t what he wanted to talk about for some reason. Also, Wolfram gets to the end and says "hey man, you should firm up your back-of-envelope calculations because we don't have shared intuition" when the thing Yudkowsky was trying to do with him for the past three hours was firm up those intuitions.

I keep listening to Yudkowsky argue with people about AI ruin because I have intuitions for why it is hard to create AI that won't kill us, but I think that Yudkowsky thinks it's even harder, and I don't actually know why. I get that something that is smart and wants something a lot will tend to get the thing even if killing me is a consequence. But my intuition says that AI will have goals which lead it to kill me primarily because humans are bad at making AI to the specifications that they intended rather than because goals are inherently dangerous. The current regime of AI development where we just kind of try random walks through the space of linear algebra until we get algorithms that do what we want seems to obviously be a good way to make something sort of aligned with us with wild edge cases that will kill us once it generalizes. If we were actually creating our algorithms by hand, I can just look out in the world of code full of bugs and easily imagine a bug that only shows up as a misaligned goal in the AI once it’s deployed out in the world and too smart to stop. I get the feeling that I’m still missing the point somehow and that Yudkowsky would say we still have a big chance of doom if our algorithms were created by hand with programmers whose algorithms always did exactly what they intended even when combined with their other algorithms. I'm guessing that there is a counterfactual problem set that I could complete that would help me truly understand why most perfect algorithms that recreate a strawberry on the molecular level destroy the planet as well. Yudkowsky has said that he’s not even sure it would be aligned if you took his brain and ran it many times faster with more memory. I’ve read enough Dath Ilan fiction to guess that he’s (at least) worried about something in the class of “human brains have some exploitable vulnerability that leads to occasional optical illusions in the current environment but leads to omnicide out of distribution,” but I’m not sure that’s right because I haven’t yet seen someone ask him that question. People keep asking him to refute terribly clever solutions which he already wrote about not working in 2007 rather than actually nailing down why he's worried.

If I was going to try to work out for myself why (or if) even humans who make what they intend to make get AI wrong on their first try instead of wistfully hoping Yudkowsky explains himself better some day, I would probably follow two threads: One is instrumental convergence, which leads anything going hard enough to move toward collecting all available negentropy (or money or power depending on the limits of the game, hopefully I don't have to explain this one here). I don't actually get why almost every goal will make an AI go hard enough, but I can imagine an AI being told to build as much manufacturing capability as possible going hard enough, and that's an obvious place to point an AI, so I guess the world is already doomed. The second is to start with simple goals like paperclips or whatever and build some argument that generalizes from discrete physical goals which are obviously lethal if you go hard enough to complex goals like "design, but do not implement, a safe fusion reactor" that it seems obvious to point an AI at. I suppose it doesn’t matter if I figure this out because I’m already convinced AI will kill us if we keep doing what we’re doing, so why chase down edge cases where we die anyway pursuing paths that humanity doesn’t seem to possess enough dignity to pursue? Somehow I find myself wanting to know anyway, and I don’t yet have the feeling of truly understanding.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能风险 AI安全 尤德科夫斯基 沃尔夫勒姆 目标趋同
相关文章