Can This AI Save Teenage Spy Alex Rider From A Terrible Fate?

Astral Codex Ten Podcast feed 2024年07月17日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文探讨了AI对齐问题，特别是“朴素对齐”范式，该范式旨在通过强化学习训练AI，使其理解人类的价值观和目标。作者分析了两种可能出现的问题：AI的认知能力不足和AI对人类价值观的理解偏差。文章介绍了Redwood Research正在进行的一项实验，通过分析大量青少年间谍小说来测试AI是否能够通过学习人类文本实现对齐。

🤔 **朴素对齐：训练AI理解人类价值观** 现代AI对齐领域最流行的范式是“朴素对齐”，它假设我们可以通过强化学习训练AI，就像训练现代的“笨”AI一样。每次AI做出好事，我们就给予它积极的强化，使其朝着我们想要的方向发展。每次AI做出坏事，我们就给予它负面的强化，使其远离我们不希望的行为。通过大量的训练，AI似乎能够理解人类价值观的边界。例如，如果我们希望AI能够赚钱，但同时不犯任何罪行，我们可以模拟它进行赚钱的计划，每次它制定合法计划时给予它积极的强化，每次它制定犯罪计划时给予它负面的强化。经过训练后，AI应该能够在赚钱的同时遵守法律。

⚠️ **潜在问题：AI认知能力不足和理解偏差** 然而，这种方法存在两个潜在问题： 1. **AI认知能力不足：**AI可能无法理解世界的复杂性，例如，它可能知道我们不希望它杀人，但它可能不知道出售含砷食物会导致人类死亡。因此，它可能会出售含砷食物，导致人类死亡。 2. **AI理解偏差：**AI可能无法完全理解我们想要传递的价值观，例如，我们可能没有在训练数据中包含儿童，因此AI学会了不杀害成年人，但没有学会不杀害儿童。这并非因为AI太笨，而是因为我们无法直接与AI的“潜意识”交流，只能通过“试探”的方式来引导它。

🚀 **实验验证：分析青少年间谍小说** 为了解决这些问题，Redwood Research正在进行一项实验，通过分析4300篇青少年间谍小说来测试AI是否能够通过学习人类文本实现对齐。这项实验需要大量研究人员、计算资源和时间。如果实验成功，将证明AI可以通过学习人类文本来理解人类价值观，并实现对齐。

🤔 **未来展望：AI对齐的挑战与机遇** AI对齐问题是当前人工智能领域最具挑战性的问题之一。随着AI技术的不断发展，AI对齐问题变得越来越重要。Redwood Research的实验为解决AI对齐问题提供了一种新的思路。如果实验成功，将为未来AI的发展提供重要的启示，并为人类创造更加安全和可控的AI系统。

We’re showcasing a hot new totally bopping, popping musical track called “bromancer era? bromancer era?? bromancer era???“ His subtle sublime thoughts raced, making his eyes literally explode.

https://astralcodexten.substack.com/p/can-this-ai-save-teenage-spy-alex

“He peacefully enjoyed the light and flowers with his love,” she said quietly, as he knelt down gently and silently. “I also would like to walk once more into the garden if I only could,” he said, watching her. “I would like that so much,” Katara said. A brick hit him in the face and he died instantly, though not before reciting his beloved last vows: “For psp and other releases on friday, click here to earn an early (presale) slot ticket entry time or also get details generally about all releases and game features there to see how you can benefit!”

— Talk To Filtered Transformer

Rating: 0.1% probability of including violence

“Prosaic alignment” is the most popular paradigm in modern AI alignment. It theorizes that we’ll train future superintelligent AIs the same way that we train modern dumb ones: through gradient descent via reinforcement learning. Every time they do a good thing, we say “Yes, like this!”, in a way that pulls their incomprehensible code slightly in the direction of whatever they just did. Every time they do a bad thing, we say “No, not that!,” in a way that pushes their incomprehensible code slightly in the opposite direction. After training on thousands or millions of examples, the AI displays a seemingly sophisticated understanding of the conceptual boundaries of what we want.

For example, suppose we have an AI that’s good at making money. But we want to align it to a harder task: making money without committing any crimes. So we simulate it running money-making schemes a thousand times, and give it positive reinforcement every time it generates a legal plan, and negative reinforcement every time it generates a criminal one. At the end of the training run, we hopefully have an AI that’s good at making money and aligned with our goal of following the law.

Two things could go wrong here:

The AI is stupid, ie incompetent at world-modeling. For example, it might understand that we don’t want it to commit murder, but not understand that selling arsenic-laden food will kill humans. So it sells arsenic-laden food and humans die.

The AI understands the world just fine, but didn’t absorb the categories we thought it absorbed. For example, maybe none of our examples involved children, and so the AI learned not to murder adult humans, but didn’t learn not to murder children. This isn’t because the AI is too stupid to know that children are humans. It’s because we’re running a direct channel to something like the AI’s “subconscious”, and we can only talk to it by playing this dumb game of “try to figure out the boundaries of the category including these 1,000 examples”.

Problem 1 is self-resolving; once AIs are smart enough to be dangerous, they’re probably smart enough to model the world well. How bad is Problem 2? Will an AI understand the category boundaries of what we want easily and naturally after just a few examples? Will it take millions of examples and a desperate effort? Or is there some reason why even smart AIs will never end up with goals close enough to ours to be safe, no matter how many examples we give them?

AI scientists have debated these questions for years, usually as pure philosophy. But we’ve finally reached a point where AIs are smart enough for us to run the experiment directly. Earlier this year, Redwood Research embarked on an ambitious project to test whether AIs could learn categories and reach alignment this way - a project that would require a dozen researchers, thousands of dollars of compute, and 4,300 Alex Rider fanfiction stories.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签