少点错误 01月01日
My AGI safety research—2024 review, ’25 plans
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文回顾了作者在2024年对人类社会本能进行逆向工程研究的主要进展,并阐述了其在人工智能安全领域的重要性。文章的核心在于探讨大脑如何通过学习子系统和引导子系统实现社会本能,并提出了“瞬时共情模拟”的概念。作者详细介绍了如何从模糊的想法到具体假设,并以“被喜欢/被钦佩的驱动力”为切入点,深入研究了社会地位、同情和怨恨等相关概念的神经科学机制。此外,文章还展望了2025年的研究方向,强调了该研究对于未来开发安全且有益的通用人工智能的潜在价值。

🧠 核心问题: 探讨大脑如何运作社会本能,特别是学习子系统和引导子系统如何协同工作,以及如何解决符号接地问题,使未标记的概念触发社会本能。

💡 研究路径: 作者以“瞬时共情模拟”为基础,深入研究了“被喜欢/被钦佩的驱动力”,以此为突破口,提出了同情和怨恨在机制上的关联性,并试图将这些概念与神经科学相结合。

🎯 未来展望: 作者认为理解人类社会本能对于开发安全的通用人工智能至关重要,因为未来的AGI可能会采用类似大脑的算法,需要内置对人类福祉的关怀。因此,深入研究人类的奖励机制和内在驱动力是关键。

Published on December 31, 2024 9:05 PM GMT

Previous: My AGI safety research—2022 review, ’23 plans. (I guess I skipped it last year.)

“Our greatest fear should not be of failure, but of succeeding at something that doesn't really matter.”  –attributed to DL Moody

Tl;dr

1. Main research project: reverse-engineering human social instincts

1.1 Background: What’s the problem and why should we care?

(copied almost word-for-word from Neuroscience of human social instincts: a sketch)

My primary neuroscience research goal for the past couple years has been to solve a certain problem, a problem which has had me stumped since the very beginning of when I became interested in neuroscience at all (as a lens into Artificial General Intelligence safety) back in 2019.

What is this grand problem? As described in Intro to Brain-Like-AGI Safety, I believe the following:

    We can divide the brain into a “Learning Subsystem” (cortex, striatum, amygdala, cerebellum, and a few other areas) that houses a bunch of randomly-initialized within-lifetime learning algorithms, and a “Steering Subsystem” (hypothalamus, brainstem, and a few other areas) that houses a bunch of specific, genetically-specified “business logic”. A major role of the Steering Subsystem is as the home for the brain’s “innate drives”, a.k.a. “primary rewards”, roughly equivalent to the reward function in reinforcement learning—things like eating-when-hungry being good (other things equal), pain being bad, and so on.Some of those “innate drives” are related to human social instincts—a suite of reactions and drives that are upstream of things like compassion, friendship, love, spite, sense of fairness and justice, etc.The grand problem is: how do those human social instincts work? Ideally, an answer to this problem would look like legible pseudocode that’s simultaneously compatible with behavioral observations (including everyday experience), with evolutionary considerations, and with a neuroscience-based story of how that pseudocode is actually implemented by neurons in the brain.[1]Explaining how human social instincts work is tricky mainly because of the “symbol grounding problem”. In brief, everything we know—all the interlinked concepts that constitute our understanding of the world and ourselves—is created “from scratch” in the cortex by a learning algorithm, and thus winds up in the form of a zillion unlabeled data entries like “pattern 387294 implies pattern 579823 with confidence 0.184”, or whatever.[2] Yet certain activation states of these unlabeled entries—e.g., the activation state that encodes the fact that Jun just told me that Xiu thinks I’m cute—need to somehow trigger social instincts in the Steering Subsystem. So there must be some way that the brain can “ground” these unlabeled learned concepts. (See my earlier post Symbol Grounding and Human Social Instincts.)A solution to this grand problem seems useful for Artificial General Intelligence (AGI) safety, since (for better or worse) someone someday might invent AGI that works by similar algorithms as the brain, and we’ll want to make those AGIs intrinsically care about people’s welfare. It would be a good jumping-off point to understand how humans wind up intrinsically caring about other people’s welfare sometimes. (Slightly longer version in §2.2 heremuch longer version in this post.)

1.2 More on the path-to-impact

1.3 Progress towards reverse-engineering human social instincts

It was a banner year!

Basically, for years, I’ve had a vague idea about how human social instincts might work, involving what I call “transient empathetic simulations”. But I didn’t know how to pin it down in more detail than that. One subproblem was: I didn’t have even one example of a specific social instinct based on this putative mechanism—i.e., a hypothesis where a specific innate reaction would be triggered by a specific transient empathetic simulation in a specific context, such that the results would be consistent with everyday experience and evolutionary considerations. The other subproblem was: I just had lots of confusion about how these things might work in the brain, in detail.

I made progress on the first subproblem in late 2023, when I guessed that there’s an innate “drive to feel liked / admired”, related to prestige-seeking, and I had a specific idea about how to operationalize that. It turned out that I was still held back by confusion about how social status works, and thus I spent some time in early 2024 sorting that out—see my three posts Social status part 1/2: negotiations over object-level preferences, and Social status part 2/2: everything else, and a rewritten [Valence series] 4. Valence & Liking / Admiring (which replaced an older, flawed attempt at part 4 of the Valence series).

Now I had at least one target to aim for—an innate social drive that I felt I understood well enough to sink my teeth into. That was very helpful for thinking about how that drive might work neuroscientifically. But getting there was still a hell of a journey, and was the main thing I did the whole rest of the year. I chased down lots of leads, many of which were mostly dead ends, although I wound up figuring out lots of random stuff along the way, and in fact one of those threads turned into my 8-part Intuitive Self-Models series.

But anyway, I finally wound up with Neuroscience of human social instincts: a sketch, which posits a neuroscience-based story of how certain social instincts work, including not only the “drive to feel liked / admired” mentioned above, but also compassion and spite, which (I claim) are mechanistically related, to my surprise. Granted, many details remain hazy, but this still feels like great progress on the big picture. Hooray!

1.4 What’s next?

In terms of my moving this project forward, there’s lots of obvious work in making more and better hypotheses and testing them against existing literature. Again, see Neuroscience of human social instincts: a sketch, in which I point out plenty of lingering gaps and confusions. Now, it’s possible that I would hit a dead end at some point, because I have a question that is not answered in the existing neuroscience literature. In particular, the hypothalamus and brainstem have hundreds of tiny cell groups with idiosyncratic roles, and most of them remain unmeasured to date. (As an example, see §5.2 of A Theory of Laughter, the part where it says “If someone wanted to make progress on this question experimentally…”). But a number of academic groups are continuing to slowly chip away at that problem, and with a lot of luck, connectomics researchers will start mass-producing those kinds of measurements in as soon as the next few years.

(Reminder that Connectomics seems great from an AI x-risk perspective, and as mentioned in the last section of that link, you can get involved by applying for jobs, some of which are for non-bio roles like “ML engineer”, or by donating.)

2. My plans going forward

Actually, “reverse-engineering human social instincts” is on hold for the moment, as I’m revisiting the big picture of safe and beneficial AGI, now that I have this new and hopefully-better big-picture understanding of human social instincts under my belt. In other words, knowing what I (think I) know now about how human social instincts work, at least in broad outline, well, what should a brain-like-AGI reward function look like? What about training environment? And test protocols? What are we hoping that AGI developers will do with their AGIs anyway?

I’ve been so deep in neuroscience that I have a huge backlog of this kind of big-picture stuff that I haven’t yet processed.

After that, I’ll probably wind up diving back into neuroscience in general, and reverse-engineering human social instincts in particular, but only after I’ve thought hard about what exactly I’m hoping to get out of it, in terms of AGI safety, on the current margins. That way, I can be focusing on the right questions.

Separate from all that, I plan to stay abreast of the broader AGI safety field, from fundamentals to foundation models, even if the latter is not really my core interest or comparative advantage. I also plan to continue engaging in AGI safety pedagogy and outreach when I can, including probably reworking some of my blog post ideas into a peer-reviewed paper for a neuroscience journal this spring.

If someone thinks that I should be spending my time differently in 2025, please reach out and make your case!

3. Sorted list of my blog posts from 2024

The “reverse-engineering human social instincts” project:

Other neuroscience posts, generally with a less immediately obvious connection to AGI safety:

Everything else related to Safe & Beneficial AGI:

Random non-work-related rants etc. in my free time:

Also in 2024, I went through and revised my 15-post Intro to Brain-Like-AGI Safety series (originally published in 2022). For summary of changes, see this twitter thread. (Or here without pictures, if you want to avoid twitter.) For more detailed changes, each post of the series has a changelog at the bottom.

4. Acknowledgements

Thanks Jed McCaleb & Astera Institute for generously supporting my research since August 2022!

Thanks to all the people who comment on my posts before or after publication, or share ideas and feedback with me through email or other channels, and especially those who patiently stick it out with me through long back-and-forths to hash out disagreements and confusions. I’ve learned so much that way!!!

Thanks to my coworker Seth for fruitful ideas and discussions, and to Beth Barnes and the Centre For Effective Altruism Donor Lottery Program for helping me get off the ground with grant funding in 2021-2022. Thanks Lightcone Infrastructure (don’t forget to donate!) for maintaining and continuously improving this site, which has always been an essential part of my workflow. Thanks to everyone else fighting for Safe and Beneficial AGI, and thanks to my family, and thanks to you all for reading! Happy New Year!

  1. ^

    For a different (simpler) example of what I think it looks like to make progress towards that kind of pseudocode, see my post A Theory of Laughter.

  2. ^

    Thanks to regional specialization across the cortex (roughly correspondingly to “neural network architecture” in ML lingo), there can be a priori reason to believe that, for example, “pattern 387294” is a pattern in short-term auditory data whereas “pattern 579823” is a pattern in large-scale visual data, or whatever. But that’s not good enough. The symbol grounding problem for social instincts needs much more specific information than that. If Jun just told me that Xiu thinks I’m cute, then that’s a very different situation from if Jun just told me that Fang thinks I’m cute, leading to very different visceral reactions and drives. Yet those two possibilities are built from generally the same kinds of data.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

社会本能 逆向工程 人工智能安全 神经科学 瞬时共情模拟
相关文章