How can AIs know what we want if *we* don't even know? (with Geoffrey Irving)

Clearer Thinking with Spencer Greenberg 2024年07月16日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

文章探讨了如何使人工智能系统与人类价值观保持一致，以及强人工智能如何实现人类期望的行为。文章涉及了助手与代理的区别，AI之间的辩论是否能作为对齐策略，以及如何确保AI学习到所有人类价值观而不仅仅是西方社会的价值观。同时，也讨论了当前最先进的语言模型是否偏向于政治左翼，以及价值观随时间变化时对齐策略的调整。

🤖 人工智能系统与人类价值观的对齐意味着AI在决策和行为上能够反映人类的道德和偏好。文章提出了几种可能的实现途径，包括通过AI之间的辩论来模拟人类价值观的多样性。

🧠 作为助手与作为代理的AI在职责和行动自由度上有所不同。文章分析了这两种角色的AI在执行任务时可能面临的伦理挑战。

🎨 文章质疑了当前AI是否能够学习到全球范围内的人类价值观，而不仅仅是西方社会的价值观，指出这是一个多元化和包容性的问题。

📈 对于AI是否偏向于政治左翼的讨论，文章提出了对现有语言模型进行更深入的政治立场分析的重要性。

🔄 文章强调了人类个体和集体的价值观随时间变化的事实，并探讨了如何设计能够适应这种变化的AI对齐策略。

Read the full transcript here.

What does it really mean to align an AI system with human values? What would a powerful AI need to do in order to do "what we want"? How does being an assistant differ from being an agent? Could inter-AI debate work as an alignment strategy, or would it just result in arguments designed to manipulate humans via their cognitive and emotional biases? How can we make sure that all human values are learned by AIs, not just the values of humans in WEIRD societies? Are our current state-of-the-art LLMs politically left-leaning? How can alignment strategies take into account the fact that our individual and collective values occasionally change over time?

Geoffrey Irving is an AI safety researcher at DeepMind. Before that, he led the Reflection Team at OpenAI, was involved in neural network theorem proving at Google Brain, cofounded Eddy Systems to autocorrect code as you type, and worked on computational physics and geometry at Otherlab, D. E. Shaw Research, Pixar, and Weta Digital. He has screen credits on Ratatouille, WALL•E, Up, and Tintin. Learn more about him at his website, naml.us.

Further reading:

Gandalf: An Educational Game Demonstrating Security Vulnerabilities in Large Language Models

"AI safety via debate"

"Claude's Constitution"

Staff

Music