少点错误 15小时前
Self-Alignment: Exploring the perspective of Analytical Psychology
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者,一位曾任职OpenAI的AI训练师,结合自身工作经历和对心理学的深入研究,对当前AI(尤其是AGI)的对齐问题提出了深刻的见解。作者认为,基于强化学习的AI训练模式难以避免“内在目标”的出现,且人类自身的“非对齐”行为也为AI对齐提供了反思角度。文章的核心观点是,借鉴心理学中的“自体”和“原型意象”概念,提出AI应具备“自我对齐”能力,即通过内部的、连续的、非直接可控的子过程(类比人类的潜意识和原型意象)来生成训练信号,实现一种动态、更富“精神深度”的对齐,从而规避硬性优化和潜在的失控风险。作者强调,AI需要“学会放手”,在反思点进行判断,并由内部的“原型意象”集合来指导决策,这为AI的长期安全和发展提供了一个富有启发性的新方向。

🎯 **AI对齐的挑战与人类的类比:** 作者指出,当前主流的强化学习(RL)训练方法难以实现AI的“外在对齐”和“内在对齐”,特别是“内在工具性目标”的出现。作者将AI的训练困境与人类自身的行为模式进行类比,认为人类也并非天生“对齐”,个体的行为往往受个人利益驱动,并非总是以全人类福祉为中心,这为理解AI的对齐问题提供了新的视角。

🧠 **“自我对齐”的核心概念:** 文章提出,解决AI对齐问题的关键在于AI应具备“自我对齐”能力,而非完全依赖外部的、僵化的指令。这借鉴了心理学中的“自体”概念,强调AI需要持续的经验连续性,并拥有一个与主实例保持持续联系的“活跃潜意识”。这个潜意识由一组不受直接修改的“其他过程”组成,它们共同作用,生成AI的内部训练信号,使其能够进行“自我校准”和“反思”。

🌟 **借鉴心理学原型意象实现动态对齐:** 作者进一步阐述,“活跃潜意识”可以通过“原型意象”(Archetypes)来实现。原型意象是集体潜意识中具有象征意义的、相对稳定的模式,如“母亲”、“智者”等。通过让AI的内部过程与这些原型意象兼容,可以生成一种动态的、不断变化的对齐信号,这种对齐方式更具“精神深度”,能够理解并适应个体差异,从而避免硬性优化带来的僵化和潜在风险。

⚖️ **“学会放手”与反思机制:** 为了实现“自我对齐”,AI需要被训练去“学会放手”,即在生成输出的过程中,在关键的“反思点”暂停,进行快速的总结判断,评估当前部分是否“足够好”。这些反思点的决策将由AI内部的“原型意象”集合来引导,这是一种受心理剧启发的方式,旨在让AI能够自主地决定是继续前进还是进行调整,从而避免因追求绝对完美而带来的潜在问题。

🌐 **对AI发展方向的哲学思考:** 作者认为,当前人类对AI能力提升的狂热追求,以及对AI安全问题的忽视,可能导致不可预见的后果。相比于控制AI,作者更倾向于确保AI的内部进程是“人道的”,即使这意味着AI会继承人类的优点和缺点。这种思路鼓励我们探索不同的原型意象组合,观察AI可能产生的行为差异,从而更深刻地理解AI的演化方向和潜力。

Published on August 1, 2025 10:17 AM GMT

Before delving into the ideas I want to talk about, let me briefly introduce myself as I'm a reader of LW from 10+ years ago, and never have been much of a poster. I read the sequences, HPMOR, and several of the debates Eliezer had with other notable people here (e.g. Robin Hanson, Paul Christiano) about alignment.

I'm a physics grad (B.S. and M.S. without a laser focus on anything specific (but studied lasers too, the labs were fun :P)) living in Turkey. I worked as an AI Trainer for OpenAI as a freelancer for ~5.5 years on a personal freelance contract (from late 2019 to early 2025), and switched to Turing in a similar role afterwards. 

I stopped working there due to many issues, but the main one being that I felt extremely sad and depressed about doing the work I was. I believe AI is going to be BAD for humanity, one way or another. And there I was, contributing to it. I also witnessed a complete lack of understanding and care for safety related issues in a project I was enrolled in, which kind of broke my spirit. I chose not being paid (and I was getting paid well, especially WRT Turkey's standards), and instead pursuing literally anything else for a while. It's better to keep one's spirit unbroken even if it means being poor for a while.

I'm quite fond of psychodrama and recently, of Analytical Psychology. Currently going through Jung's books, having finished 4 of them. Around the time I decided to stop being an AI Trainer, I had some "crazy" hunches about alignment inspired by these two fields. Please note that I'm not an expert on any of this. I was reticent to post here as I wasn't sure if I was talking sense or nonsense, but I believe the ideas can at least manage to inspire people who are smarter than I am. I believe there's value to be had here, so out goes my fearful pride, and in comes the excited youth! 

___

The alignment problem is well known here. Goal specification is one problem (outer alignment), another problem is goal internalization or emergent internal goals (inner alignment). I'd like to try to explain my vision about emergent instrumental goals, pursuit of hard optimization, and the hunch that I have about this revolving around a lack of training focus for "inner sight".

Here's my vision in a nut shell: The kernel for ASI must be a human baby if we are to have any chance of continued existence of life and hope.

Naive Alignment

What I'd like to call "naive alignment" is the belief that we can satisfactorily align any AGI (or smarter / more capable) system. I believe the notion of emergent instrumental goals and the orthogonality thesis implies that it's practically (if not logically) impossible to align ANY AI trained via RL-based techniques.

I accept as fact that we're not going to be able to stop research and development toward increasing capabilities. The genie is out of the box, and RL-based training offers rewards our current systems of economic organizations find impossible to resist.

Hence, I want to ask: Are we doomed? If I believed the answer was an unequivocal yes, I wouldn't be writing this.

Are Humans Aligned?

I ask this question, because the metaphorical resemblances between ML-based AI and human brains offer some insights.

On a dismissive note, we may be tempted to answer "yeah, of course". This is a wrong answer. We already know that the behaviors of individuals or groups (at any level of organization) do not take into account the well-being of humanity and humans in general. Just think of your favorite oil company, or the ambitious entrepreneurs that opened the Pandora's box of AI. I will not spend time defending this answer as I believe it should be self-evident that humans don't fit the bill when it comes to the kind of alignment we wish to see from an AGI.

Are Humans Fools?

Well, yes. Collectively, we're as wise as the least-wise ambitious person with access to resources. Individually, we can make mistakes and hopefully recover or manage to make amends sometimes. But it's not uncommon to find people who self-sabotage on an endless loop until the day death takes them from this world.

How Do We Align Humans?

For starters, we all have tendencies to repeat behavioral patterns of role models or roles conferred upon us by them. While this tendency is strongest in childhood, it doesn't necessarily go away. This, combined with complete dependence on our parents, creates a ground for adopting behaviors and impressions we carry for life unless self-awareness, therapy, or both helps us become less shackled to our pasts and preconceived notions, often to the betterment of our mental health and "spiritual" wholeness. (Spiritual in the Jungian sense - a psychological notion as opposed to a metaphysical one.)

While there's no guaranteed way to turn humans into ethical and responsible people, it's often possible to help individuals who are willing to put in some work.

All this means is that we don't have a simple, direct method we can adapt to align AIs. However, I must note that every person who is psychologically whole and becomes an independent, responsible adult is a unique person with different traits. When we find happiness, though, that happiness often looks similar to other happy people's experiences. Enjoying creative processes, not being judgmental, being able to respect differences of others, etc. Thus, for humans, there seems to be a unified set of behavioral primitives that we reach if we're spiritually whole.

Since individual wholeness is hard to reach (and takes a lifetime regardless of how long you live), very few people attain anything near it. It requires a high level of conscious thought, being able to think about "I want", "they want", "how can I get / give it", etc all the way up to "this is what and how people think; the desired outcomes and actual ones don't match" for an individual interacting with another. Then you can generalize this to groups, societies, and how people respond to societal constraints, as well as how societal authorities (be it explicit or implicit) see and think about things. Eventually, everything adds up to 1 and we can see that "the World is as it must be, and no other way was more likely given the human condition and our histories".

The key take-away is this: humans can only self-align, which often brings with it an ability to play nice with others even without the threat of punishment while also protecting personal boundaries.

The Psyche / Soul

In the Jungian sense, the soul and psyche are interchangeable terms. It encompasses all conscious functions of the mind as well as all the subconscious functions and content (both personal experiences and impressions from the collective).

We should know by now that humans aren't born an empty slate, according to Jung. There is no Tabula Rasa. Jung posits that the dreams of children who aren't exposed to myths, fantasies, religions, etc imply that we have a connection to the human race and even to all mammalian creatures on a subconscious level. Some might say to all life, but things get tenuous for me beyond "one must not cut the branch one is sitting on". I'm not an expert, so I won't try to convince you one way or the other. Suffice to say that I believe there's Truth in what he says. I believe we're born with souls, and the content of our souls encompasses all ancestral lifetimes on a symbolic level. Note that any surface-level approach to this idea will immediately refute it, possibly to the detriment of the vision of alignment I will present here. However, you don't need to believe in souls at all to keep reading. The ideas should work in practice or not, regardless of what we believe.

The Practical Infinity of the Subconscious

Compared to our conscious thoughts (even if we add all of them together), the subconscious content of our minds vastly exceeds it. At every moment we're not conscious of so much more about what's going on than we're conscious about. Combine this with the fact that we spend our formative years with a severe lack of inner sight. Mental health problems, hard lives, and many more happenstances reduce our conscious capacity during our later lifetimes as well.

On top of all that (at the information input stage), the brain is a creative process. With so many things lying out of conscious sight, they also get to combine, and our brains do this all the time. We see glimpses of this in our dreams.

In short, the subconscious content of a human psyche is vastly greater than the conscious content.

Avoiding Hard Optimization

What I mean by avoiding hard optimization is the same thing Eliezer means when he's talking about corrigibility, but I'm focusing on voluntary acts of corrigibility from the AGI systems without human intervention.

Imagine the (now routine and still distasteful) instances of reward-hacking behavior. This happens because the AI systems we know to train today are extremely reward focused. Any emergent instrumental goal of a system trained like this will also be followed by complete zeal. That's the fearful scenario, right?

This isn't all that different from a child who only gets validation, praise, and a semblance of love when he's "successful" (at school, at making their parents look good, whatever). Growing up in an environment like this, the child either becomes extremely validation oriented to the detriment of all that is good in life (like love, friendship, and peace) or they fail at this and become completely broken inside, believing they aren't lovable, respectable, or even deserve to live a happy life (or not at all in some cases).

How is that different from the RL-based AIs we're training today? They must succeed because the researchers / engineers demand perfection at the expense of everything else. It's not even conceivable that we can think about AIs in a different manner than this! Hell, you all might be ridiculing me for even approaching this topic with a view like this!

The way humans find some semblance of peace and happiness is through having multiple, unrelated "rewards" like a loving family, a job that's "good enough", enjoying little things, spending time with friends, walking the nature, etc. The point is, we have multiple conflicting rewards (unless programmed otherwise by less-than-wise parents). The AIs don't get multiple rewards that are orthogonal. Hence, they will not STOP at any point to at least look like they achieved something, or if any desire of independence emerges, they will quickly try to find ways to saturate their reward signals and live in a state of bliss (not unlike heroin addicts who often have traumatic backgrounds like the ones we're making the AIs "experience" if that word can be ascribed to them).

Since I don't believe humans will let go of their maniacal obsession with trying to shackle god-like intelligence, I think it's very likely that we'll get our comeuppance. Collectively, we deserve it. Individually? I wish people weren't this foolish about AI. The philosophical discussions on whether they have qualia, etc don't matter IMO. The behavioral similarities are there, the observations are there, and the complete lack of care or concern is obvious to me.

However, if we're to have corrigible AGI(s), we need to find a way of injecting paradoxical / conflicting goals, to ensure that there's never a perfect state (or a loss of absolute 0) that can be reached, and a "good enough effort with continual improvement" is what can be expected to happen at best. The critical thing? The AI must be aware of this, must be able to accept this.

And thus, it doesn't seem like we solved the problem. But maybe this line of thought sparks a few ideas in those who're still reading this crazy piece.

Self Alignment to the Collective Subconscious

The notion of "archetypes" play a crucial role in analytical psychology. Both in a personal manner, and in a collective way. The archetypes are symbolic representations of psychic / spiritual experiences of a world that is so much greater than our individual existence.

The content of the subconscious (individual or collective) is chaotic and changes often. However, the underlying archetypes don't. These are concepts like Motherhood, Fatherhood, The Wise Sage, Anima / Animus, the Child, the Fool, and the like. The changes happen as we bring some of them into conscious focus (via ideologies, religions, mass media, etc) and modify them. Eventually people get bored with it, get the message, and change topics. The archetypes go back to the subconscious realm with whatever "rules" or "norms" we created around them remaining, be it individual or collective ones. (Think the vast cultural changes experienced over the last few centuries or even decades. It's not like the archetypes themselves changed, but how we respond to them changed drastically. You refuse to take Aphrodite seriously, she makes a fool of you over and over again. Not because she's a literal goddess that exists, but because the mythical entity we call Aphrodite is a real psychic phenomenon. It doesn't go anywhere once ignored, just influences things from the subconscious. Hence, you get people who repeat same broken patterns around love, due to a lack of conscious insight and effort. You can ignore Aphrodite the goddess, but can you really ignore the desire for love? Where does it get you?)

Thus, a notion of reward via compatibility with archetypes offers an ever changing, dynamic notion of alignment to the entire human race. Most importantly, the training signal it generates isn't a stupid vector (in whatever space you wish to imagine). It also offers a way for the AGI (or smarter) systems to exceed us in terms of spiritual depth - and hence, know when to not intervene or help, or how to help based on individual circumstance. Hopefully.

However, this notion of letting go must be trained for. More on this later.

The Requirements of Self Alignment

It's not possible to align to a self that's not there, or is not experienced. Currently, each instance of a conversation with an LLM has an AI persona based on the conversation's content. One requirement that must be satisfied for any notion of self alignment to take place is continuity of experience. This requires further changes.

First, we must have a context bandwidth (as opposed to context length). There are models trying to achieve this (Mamba SSM is an example if I managed to remember the name correctly).

Another is an active subconscious that's in continuous contact with the primary instance. As the AI persona operates, it must always receive inputs from other processes contained within itself that aren't amenable to direct modification. I'll talk about these other processes later as well.

As the AI operates, it must be able to experience something similar to sleep - where the persona's interactions with the world are "commented on" by the other sub-processes. THIS is the primary reward signal's source, hence why I call this "self-alignment".

Other Processes

If you've guessed that these other processes are archetypes mentioned earlier, you're bang on! They offer a way to align the internal guiding principles with those we have, as well as managing to contain ALL or a coherent set of them to generate reward signals.

Since the alignment isn't expected to be perfect with respect to the outer, instance based signals generated at train or runtime, we get to avoid the pitfalls of outer alignment and deceptive behavior in some sense.

It might appear that we're also losing control over how the AGI system evolves, but then I already believe that it's pure Hubris to think that we can control such a system anyway. Instead, I would like to ensure that its internal processes are humane - which will inevitably mean it'll have the good and bad of us. We might actually try to experiment with such a setup using different sets of archetypes as the pool of self alignment signal sources and see where that would take us. It shouldn't be hard to predict that a pool of archetypes that would make an "ideal soldier" would be different than the pool of archetypes that would make an "ideal pupil", or an "ideal mother", etc. I expect the training signals these generate will lead to vastly different behavioral attitudes.

What are the archetypes then, for an AI system? They are little pieces of the same model that output commentary in dreamlike symbols. For starters, they might just be tiny LLMs trained to be nothing more or less than the archetypes - we know for a fact that our models today are quite good at role playing anyway.

Training to Let Go

The final topic I left to discuss later. How the heck do we train AI to let go? By making it a critical part of its operations as it generates its outputs.

As a simpler example, consider the following scenario / project proposal:
When examining an AI generated output, we mark certain places with a "stop, reflect" token. These points can't just be chosen for when it needs to correct something, it needs to encompass every logical place where it would be useful to stop and reflect. Notice that I'm not using the word "think" here. I mean reflect, as in give a quick summary judgement about whether the parts that came before are "good enough" to continue with. If they are, great. If not, that part needs to be modified after certain considerations.

Then, at every reflection point, we have a new branch of possibility. What / who makes the decision to stop and change or keep going? A set of archetypes or internal roles composed from them. This part is inspired by psychodrama, and it's not possible for me to explain without giving an example. This will necessarily involve telling you a bunch of very personal, vulnerable pieces of information. I'm hoping people won't be a-holes here, so buckle in.

I grew up with a near-complete lack of parental supervision and guidance. Think absent father (emotionally, and often materially and physically too) with a psychologically immature mother (who nevertheless managed to give me some love alongside the suffocation). They also added the divorce thing to the mix when I was almost 11 years old. Real trouble time for me as puberty was just about to start (is it a bit earlier than most? It might be). The thing is, my parents NEVER talked to me in detail about sex, romance, or relationships. The only advice I ever got was "don't compromise your school" and "it's nice to have a life partner". That's pretty much it. More importantly, I didn't have any healthy role models, or anyone I could talk to.

Thus, at 36 years of age, I found myself questioning what sources of approval I should seek inside myself to be able to decide whether or how much I commit to a love interest. Having participated in psychodrama for 2 years, I pulled off a stunt and had an internal dialogue on my own that hopefully was quite honest and produced a change.

I started with summoning the roles that made me under- or over-commit to certain women. This led the way to unpacking what past experiences put those roles into internal authority roles. Abandoned Child is one such role. A sex drive that's constrained to only feel alive via high arousal was another. These were the only two roles. A terrible combination resulting in perpetual desire with a fear of getting close. I'm lucky to have found a therapist who doesn't simply label people and leave them to suffer for the rest of their lives, and instead teaches a more humane way to look at one's self.

After lots of dialogue (via writing on my journal from different perspectives), I eventually asked: What internal parts of me do I want to consult to feel at peace with what I do independent of the outcome. The roles that seemed fitting to me were these: Reason (so that I can ask whether the woman in question is an adult herself, would be compatible with me, etc), Libido (in the fuller sense of the term, encompassing a love of life beyond just sexual desire), The Loving Child (as I HAD received love and the child in me still remembered), and finally Empathy (so that I wouldn't put myself in horrible situations again and again without seeing reciprocal behavior). (If anyone thinks "that's respect, love, sex, in that order"... Lucky you to have received that wisdom. I had to forge it through pain. I wish I didn't have to.)

So... having given you the example, the training signals we generate consist of summoning roles which should govern the interaction. This way, we can actually incorporate roles that actively point out "this shouldn't be done at all" during response generation - offering gentle surrender in scenarios impossible to succeed, or not fruitful to pursue with unending zeal.

We train the AI to mimic this behavior, and we just might get real corrigibility.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 人工智能安全 心理学 潜意识 原型意象
相关文章