少点错误 2024年07月05日
Finding the Wisdom to Build Safe AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着人工智能技术的飞速发展,我们正接近创造出超级智能AI的临界点。然而,如果这种AI没有与我们人类的繁荣目标相一致,它将对人类乃至地球上的所有生命构成生存威胁。由于智慧和价值观之间存在着巨大差异,并且“古德哈特效应”非常强大,因此我们无法依靠AI自行决定成为安全的,也无法训练它保持安全。为了更好地创造安全、一致、超级智能的AI,我们需要创造出“明智”的AI,即它知道如何采取正确的行动来实现预期结果,并且不会陷入智力或优化陷阱。

🤔 **明智的AI安全研究人员的重要性:** 作者认为,要了解如何构建明智的AI,必须首先自身具备智慧,或者至少有一位明智的人来验证想法。智慧可能不足以构建明智且一致的AI,但它在作者的假设中是必要的,就像一个人如果不能自己推理如何在游戏中最大化预期价值,就很难为AI开发出良好的决策理论。

🧘 **如何获得智慧:** 作者通过练习禅宗佛教,以及学习哲学、心理学、数学和博弈论来获得智慧。他认为,智慧的关键在于找到正确的启发式方法,这些方法可以产生良好的推理,进而产生良好的行动,最终导致良好的结果。

🎯 **明智的启发式方法:** 作者认为,谦虚和善良是最重要的两种启发式方法。谦虚意味着不要欺骗自己,相信自己不合理的信念,例如,拥有良好的校准信念,不要陷入“典型的思维谬误”,不要将自己与现实分开。善良意味着愿意采取对自身和他人最有利的行动,而不是为了其他目标而采取行动,例如最大限度地减少个人痛苦或最大限度地提高个人收益的机会。

🤝 **如何培养明智的人类:** 作者以禅宗为例,解释了如何培养智慧。学生需要与老师密切合作多年,包括数千小时的冥想,老师在冥想和伦理方面的指导,以及将这些指导付诸实践,以便老师能够观察并提供修正。这种训练的目的是帮助学生觉醒于生命的现实,摆脱痛苦(“开悟”),这需要培养“超越智慧的智慧”。结果是,老师被认为是在这种训练期间将他们的智慧心传心授。

🔍 **如何判断一个人是否明智:** 禅宗通常由老师来证明学生的智慧。这通过“受戒”和“传法”的仪式来实现。在“受戒”中,学生发誓要道德行为并遵循教义的智慧,这发生在经过一段时间的训练后,学习如何践行这些誓言。后来,如果老师认为学生足够有能力和明智,学生可能会获得“传法”和传授教义的许可,从而形成一个可以追溯到数百年前的智慧认证传承。

🤖 **能否用训练人类的方式训练明智的AI:** 作者认为,这取决于AI与我们的相似程度,以及对人类有效的训练方法对AI是否有效。重要的是,答案取决于AI是否可以不发生“古德哈特效应”而接受训练,以及我们是否能够判断AI是“古德哈特效应”的智慧还是真正的智慧。

🚀 **训练明智的AI的意义:** 如果我们能够训练AI变得明智,这意味着我们可以自动化训练,因为如果我们能够训练AI变得明智,那么我们也可以训练AI来训练其他AI变得明智。这将是一个巨大的进步,但它也带来了巨大的风险,我们必须谨慎地进行。

💡 **未来展望:** 作者最后指出,我们必须认真思考如何将AI与人类价值观相一致,以确保AI的发展不会对人类造成威胁。我们应该积极探索培养明智的AI的方法,并为未来的发展做好准备。

Published on July 4, 2024 7:04 PM GMT

We may soon build superintelligent AI. Such AI poses an existential threat to humanity, and all life on Earth, if it is not aligned with our flourishing. Aligning superintelligent AI is likely to be difficult because smarts and values are mostly orthogonal and because Goodhart effects are robust, so we can neither rely on AI to naturally decide to be safe on its own nor can we expect to train it to stay safe. We stand a better chance of creating safe, aligned, superintelligent AI if we create AI that is "wise", in the sense that it knows how to do the right things to achieve desired outcomes and doesn't fall into intellectual or optimization traps.

Unfortunately, I'm not sure how to create wise AI, because I'm not exactly sure what it is to be wise myself. My current, high-level plan for creating wise AI is to first get wiser myself, then help people working on AI safety to get wiser, and finally hope that wise AI safety researchers can create wise, aligned AI that is safe.

For close to a decade now I've been working on getting wiser, and in that time I've figured out a bit of what it is to be wise. I'm starting to work on helping others get wiser by writing a book that explains some useful epistemological insights I picked up between pursuing wisdom and trying to solve a subproblem in AI alignment, and have vague plans for another book that will be more directly focused on the wisdom I've found. I thus far have limited ideas about how to create wise AI, but I'll share my current thinking anyway in the hope that it inspires thoughts for others.

Why would wise AI safety researchers matter?

My theory is that it would be hard for someone to know what's needed to build a wise AI without first being wise themself, or at least having a wiser person to check ideas against. Wisdom clearly isn't sufficient for knowing how to build wise and aligned AI, but it does seem necessary under my assumptions, in the same way that it would be hard to develop a good decision theory for AI if one could not reason for oneself how to maximize expected value in games.

How did I get wiser?

Mostly by practicing Zen Buddhism, but also by studying philosophy, psychology, mathematics, and game theory to help me think about how to build aligned AI.

I started practicing Zen in 2017. I picked Zen with much reluctance after trying many other things that didn't work for me, or worked for a while and then had nothing else to offer me. Things I tried included Less Wrong style rationality training, therapy, secular meditation, and various positive psychology practices. I even tried other forms of Buddhism, but Zen was the only tradition I felt at home with.

Consequently, my understanding of wisdom is biased by Zen, but I don't think Zen has a monopoly on wisdom, and other traditions might produce different but equally useful theories of wisdom than what I will discuss below.

What does it mean to be wise?

I roughly define wisdom as doing the right thing at the right time for the right reasons. This definition puts the word "right" through a strenuous workout, so let's break it down.

The "right thing" is doing that which causes outcomes that we like upon reflection. The "right time" is doing the right thing when it will have the desired impact. And the "right reasons" is having an accurate model of the world that correctly predicts the right thing and time.

How can the right reasons be known?

The straightforward method is to have true beliefs and use correct logic. Alas, we're constantly uncertain about what's true and, in real-world scenarios, the logic becomes uncomputable, so instead we often rely on heuristics that lead to good outcomes and avoid optimization traps. That we rely on heuristics doesn't mean facts and logic are not useful, only that heuristics are necessary to fill in their gaps in most cases. This need for heuristics suggests that the root of wisdom is finding the right heuristics that will generate good reasoning, which will in turn generate good actions that lead to good outcomes.

What are some wise heuristics?

The two that have been the most important for me are humility and kindness. By humility I mean not deceiving myself into believing my own unjustified beliefs, like by having well-calibrated beliefs, not falling for the typical mind fallacy, and not seeing myself as separate from reality. By kindness I mean a willingness to take the actions that most benefit myself and others rather than take the actions that optimize for something else, like minimizing the risk of personal suffering or maximizing the chance for personal gain.

I don't have a rigorous argument for these two heuristics, other than I tried a lot of different ones and these two have so far worked the best to help me find the right things, times, and reasons. Other heuristics might work better for others, or might be better for me, or might be better for everyone who adopts them. But, for what it's worth, humility and kindness are almost universally recommended by religions and other wisdom traditions, so I suspect that many others agree that these are two very useful wisdom heuristics, even if they are not the set of maximally useful ones.

How do we find wise heuristics?

Existing wisdom traditions, like religions, provide us with a large set of heuristics we can choose from. For any particular person looking to become wiser, the problem of finding wise heuristics is mostly one of experimentation. A person can adopt one or more heuristics, then see if those heuristics help them achieve their reflexively desired outcomes. If not, they can try again with different heuristics until they find ones that work well for them.

We collectively benefit from these individual experiments. For millennia our ancestors ran similar experiments with their lives and have passed on to us the wisdom heuristics that most reliably served them well. Thus the set of heuristics they've provided us with have already been tested and found effective.

There may be other, better wisdom heuristics that cultural evolution could not find, but we should expect no more than marginal improvements over existing heuristics. That's because most wisdom heuristics are shared as simple, fuzzy concepts, so any "new" heuristics are not going to be clearly distinct from existing ones. For example, if someone were to propose I replace my heuristics of humility and kindness with meekness and goodwill, they would have to explain how meekness and goodwill are more than restatements of humility and kindness such that I would have reason to adopt them. Even if these heuristics were different, they would likely only be different on the margin, and would be unlikely to offer Pareto improvements over my existing heuristics.

Therefore I expect that, for the most part, we've already adequately explored the space of wise heuristics for humans, and the challenge of becoming wise is not so much in finding wise heuristics as it is in learning how to effectively apply the ones we already know about.

How do we train wise humans?

I don't know all the ways we might do it, but here's a rough outline of how we train wisdom in Zen.

A student works closely with a teacher over many years. That work includes thousands of hours of meditation, instruction from their teacher in both meditation and ethics, and putting that instruction into practice where the teacher can observe and offer corrections. One purpose of this work is to help the student wake up to the reality of life and free themselves from suffering ("enlightenment"), which requires the cultivation of "wisdom beyond wisdom". The result is that the teacher is said to transmit their wisdom mind-to-mind over the course of this training period.

I'm not going to claim that Zen teachers know how to psychically transmit thoughts directly from one mind to another. Instead, they are slowly and gradually training the student to develop the same generators of thought and action as they have. Thus, rather than training the student to appear wise and enlightened, they are attempting to remake the student into the type of person who is wise and enlightened, and thereby avoid Goodharting wisdom.

As will perhaps be obvious, this is not reliably possible, and yet Zen has managed to transmit its wisdom from one generation to the next without collapsing from runaway Goodharting, so its methods must work to some extent.

How do we know when a human is wise?

Again, I'll answer from my experience with Zen.

It is generally up to a Zen teacher to testify to the wisdom of their students. This is achieved through the rituals of jukai and dharma transmission. In jukai, a student takes vows to behave ethically and follow the wisdom of teachings, and it comes after a period of training to learn to live those vows. Later, a student may receive dharma transmission and permission to teach if their teacher deems them sufficiently capable and wise, creating a lineage of wisdom certification stretching back hundreds of years.

What's important about these processes is that they place the authority to recognize wisdom in another person. That is, a person is not a reliable judge of their own wisdom because it is too easy to self-deceive, so we instead rely on another person's judgement. And the people who are best going to be able to recognize wisdom are those who are wise themselves because others judge them to be so.

Thus we know someone is wise because other people, and wise people in particular, can recognize wisdom in others.

Could we train wise AI the way we train wise humans?

Maybe. It would seem to mostly depend on how similar AI are to us, and to what extent training methods that work on humans would work on AI. Importantly, the answer will hinge on whether or not AI can be trained without Goodharting on wisdom, and whether or not we can tell if an AI has Goodharted wisdom rather than become actually wise.

If we can train AI to be wise, it would imply an ability to automate training, because if we can train a wise AI, then in theory that AI could train other AIs to be wise in the same way wise humans are able to train other humans to be wise. We would only need to train a single wise AI in such a scheme who could pass on wisdom to other AIs.

Can wisdom recognition be automated?

I'm not sure. Automation generally requires the use of measurable, legible signals, but in Zen we mostly avoid legibility. Instead we rely on the conservative application of intuitive pattern matching, like a teacher closely observing a student for years. My theory is that this is a culturally evolved defense against Goodharting.

In theory, it might be possible to train an LLM to recognize wisdom in the same way a Zen teacher would, but it would require first finding a way to train this LLM in Zen with a teacher who would be willing to give it dharma transmission. I'm doubtful that we can use training methods that work on humans, but they do offer inspiration. In particular, I suspect Zen's model of mind-to-mind transmission only works because, typical mind fallacy not withstanding, some people really do think very similarly, and when a student is similar enough to what the teacher was like prior to their own training, the teacher is able to train the student in the same way they were trained and be largely successful. In short, training succeeds because student and teacher have sufficiently similar minds.

It's always possible that the path to superintelligent AI will pass through designs that closely mimic human minds, but that seems unlikely given we've made tremendous AI progress already with non-human-like designs. Thus it's more likely that, if we were to attempt training wisdom into AIs, we would need to look for ways to do it that would generalize to minds not like ours.

Can we use Reinforcement Learning to train wisdom?

I'm doubtful that we can successfully train wisdom using known RL techniques. The big risk with RL is Goodharting, and I don't see signs that we've found RL methods that are likely to be sufficiently robust to Goodharting under the extreme optimization pressure of superintelligent AI. At best we might be able to use RL to train wise AI that helps us to build wise superintelligent AI, but would be inadvisable to use RL to directly train a superintelligent AI to be wise.

How else might we create wise AI?

I don't have a solid answer, but given the ultimate goal is to create superintelligent AI that is aligned with human flourishing, there might be a way to use relatively wise AI to help us bootstrap a safe, superintelligent AI.

One way this could go is the following:

    We train an LLM to be an expert on AI design and wisdom. We might do this by feeding it AI research papers and "wisdom texts", like principled arguments about wise behavior and stories of people behaving wisely, over and above those base models already have access to, and then fine tuning to prioritize giving wise responses.We simultaneously train some AI safety researchers to be wiser.Our wise AI safety researchers use this LLM as an assistant to help them think through how to design a superintelligent AI that would embody the wisdom necessary to be safe.Iterate as necessary, using wisdom and understanding developed with the use of less wise AI to train more wise AI.

This is an extremely hand-wavy plan, so I offer it only as inspiration. The actual implementation of such a plan will require resolving many difficult questions such as what research papers and wisdom texts the LLM should be trained on, which AI safety researchers are wise enough to succeed in making progress towards safe AI, and when enough progress will have been made that superintelligent AI can safely be created.

Doesn't this plan still risk Goodharting wisdom?

Yep! As with many problems in AI safety, the fundamental problem is preventing Goodharting. The hope I hold on to is that people sometimes manage to avoid Goodharting, such as when a Zen teacher successfully transmits wisdom to their students. Based on such examples of non-Goodharting training regimes, we may find a way to train superintelligent AI that stays safe because it doesn't succumb to Goodhart Curse or other forms of Goodharting.

What's next?

Personally, I'm going to continue to focus on helping myself and others get wiser. I seriously doubt it's the most impactful thing we can do to ensure the creation of safe, aligned, superintelligent AI, but it's the most impactful thing I expect to be able to make progress on right now.

As for you and other readers, I see a few paths forward:

    Work on getting wiser yourself.Share the wisdom you have with others.Work on training LLMs that not only understand wisdom, but robustly apply it.Look for ways we might create AIs that could train other AIs to be wise.Figure out how AI safety research can outpace capabilities progress such that we have the time needed to figure out how to build wise AIs, or more generally how to create AI that is aligned with our flourishing.

Thanks to Owen Cotton-Barrett and Justis Mills for helpful comments on earlier drafts.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 智慧 古德哈特效应 AI安全 禅宗
相关文章