Self-Alignment: Exploring the perspective of Analytical Psychology

Published on August 1, 2025 10:17 AM GMT

Before delving into the ideas I want to talk about, let me briefly introduce myself as I'm a reader of LW from 10+ years ago, and never have been much of a poster. I read the sequences, HPMOR, and several of the debates Eliezer had with other notable people here (e.g. Robin Hanson, Paul Christiano) about alignment.

I'm a physics grad (B.S. and M.S. without a laser focus on anything specific (but studied lasers too, the labs were fun :P)) living in Turkey. I worked as an AI Trainer for OpenAI as a freelancer for ~5.5 years on a personal freelance contract (from late 2019 to early 2025), and switched to Turing in a similar role afterwards.

I stopped working there due to many issues, but the main one being that I felt extremely sad and depressed about doing the work I was. I believe AI is going to be BAD for humanity, one way or another. And there I was, contributing to it. I also witnessed a complete lack of understanding and care for safety related issues in a project I was enrolled in, which kind of broke my spirit. I chose not being paid (and I was getting paid well, especially WRT Turkey's standards), and instead pursuing literally anything else for a while. It's better to keep one's spirit unbroken even if it means being poor for a while.

I'm quite fond of psychodrama and recently, of Analytical Psychology. Currently going through Jung's books, having finished 4 of them. Around the time I decided to stop being an AI Trainer, I had some "crazy" hunches about alignment inspired by these two fields. Please note that I'm not an expert on any of this. I was reticent to post here as I wasn't sure if I was talking sense or nonsense, but I believe the ideas can at least manage to inspire people who are smarter than I am. I believe there's value to be had here, so out goes my fearful pride, and in comes the excited youth!

___

The alignment problem is well known here. Goal specification is one problem (outer alignment), another problem is goal internalization or emergent internal goals (inner alignment). I'd like to try to explain my vision about emergent instrumental goals, pursuit of hard optimization, and the hunch that I have about this revolving around a lack of training focus for "inner sight".

Here's my vision in a nut shell: The kernel for ASI must be a human baby if we are to have any chance of continued existence of life and hope.

Naive Alignment

What I'd like to call "naive alignment" is the belief that we can satisfactorily align any AGI (or smarter / more capable) system. I believe the notion of emergent instrumental goals and the orthogonality thesis implies that it's practically (if not logically) impossible to align ANY AI trained via RL-based techniques.

I accept as fact that we're not going to be able to stop research and development toward increasing capabilities. The genie is out of the box, and RL-based training offers rewards our current systems of economic organizations find impossible to resist.

Hence, I want to ask: Are we doomed? If I believed the answer was an unequivocal yes, I wouldn't be writing this.

Are Humans Aligned?

I ask this question, because the metaphorical resemblances between ML-based AI and human brains offer some insights.

On a dismissive note, we may be tempted to answer "yeah, of course". This is a wrong answer. We already know that the behaviors of individuals or groups (at any level of organization) do not take into account the well-being of humanity and humans in general. Just think of your favorite oil company, or the ambitious entrepreneurs that opened the Pandora's box of AI. I will not spend time defending this answer as I believe it should be self-evident that humans don't fit the bill when it comes to the kind of alignment we wish to see from an AGI.

Are Humans Fools?

Well, yes. Collectively, we're as wise as the least-wise ambitious person with access to resources. Individually, we can make mistakes and hopefully recover or manage to make amends sometimes. But it's not uncommon to find people who self-sabotage on an endless loop until the day death takes them from this world.

How Do We Align Humans?

For starters, we all have tendencies to repeat behavioral patterns of role models or roles conferred upon us by them. While this tendency is strongest in childhood, it doesn't necessarily go away. This, combined with complete dependence on our parents, creates a ground for adopting behaviors and impressions we carry for life unless self-awareness, therapy, or both helps us become less shackled to our pasts and preconceived notions, often to the betterment of our mental health and "spiritual" wholeness. (Spiritual in the Jungian sense - a psychological notion as opposed to a metaphysical one.)

While there's no guaranteed way to turn humans into ethical and responsible people, it's often possible to help individuals who are willing to put in some work.

All this means is that we don't have a simple, direct method we can adapt to align AIs. However, I must note that every person who is psychologically whole and becomes an independent, responsible adult is a unique person with different traits. When we find happiness, though, that happiness often looks similar to other happy people's experiences. Enjoying creative processes, not being judgmental, being able to respect differences of others, etc. Thus, for humans, there seems to be a unified set of behavioral primitives that we reach if we're spiritually whole.

Since individual wholeness is hard to reach (and takes a lifetime regardless of how long you live), very few people attain anything near it. It requires a high level of conscious thought, being able to think about "I want", "they want", "how can I get / give it", etc all the way up to "this is what and how people think; the desired outcomes and actual ones don't match" for an individual interacting with another. Then you can generalize this to groups, societies, and how people respond to societal constraints, as well as how societal authorities (be it explicit or implicit) see and think about things. Eventually, everything adds up to 1 and we can see that "the World is as it must be, and no other way was more likely given the human condition and our histories".

The key take-away is this: humans can only self-align, which often brings with it an ability to play nice with others even without the threat of punishment while also protecting personal boundaries.

The Psyche / Soul

In the Jungian sense, the soul and psyche are interchangeable terms. It encompasses all conscious functions of the mind as well as all the subconscious functions and content (both personal experiences and impressions from the collective).

We should know by now that humans aren't born an empty slate, according to Jung. There is no Tabula Rasa. Jung posits that the dreams of children who aren't exposed to myths, fantasies, religions, etc imply that we have a connection to the human race and even to all mammalian creatures on a subconscious level. Some might say to all life, but things get tenuous for me beyond "one must not cut the branch one is sitting on". I'm not an expert, so I won't try to convince you one way or the other. Suffice to say that I believe there's Truth in what he says. I believe we're born with souls, and the content of our souls encompasses all ancestral lifetimes on a symbolic level. Note that any surface-level approach to this idea will immediately refute it, possibly to the detriment of the vision of alignment I will present here. However, you don't need to believe in souls at all to keep reading. The ideas should work in practice or not, regardless of what we believe.

The Practical Infinity of the Subconscious

Compared to our conscious thoughts (even if we add all of them together), the subconscious content of our minds vastly exceeds it. At every moment we're not conscious of so much more about what's going on than we're conscious about. Combine this with the fact that we spend our formative years with a severe lack of inner sight. Mental health problems, hard lives, and many more happenstances reduce our conscious capacity during our later lifetimes as well.

On top of all that (at the information input stage), the brain is a creative process. With so many things lying out of conscious sight, they also get to combine, and our brains do this all the time. We see glimpses of this in our dreams.

In short, the subconscious content of a human psyche is vastly greater than the conscious content.

Avoiding Hard Optimization

What I mean by avoiding hard optimization is the same thing Eliezer means when he's talking about corrigibility, but I'm focusing on voluntary acts of corrigibility from the AGI systems without human intervention.

Imagine the (now routine and still distasteful) instances of reward-hacking behavior. This happens because the AI systems we know to train today are extremely reward focused. Any emergent instrumental goal of a system trained like this will also be followed by complete zeal. That's the fearful scenario, right?

This isn't all that different from a child who only gets validation, praise, and a semblance of love when he's "successful" (at school, at making their parents look good, whatever). Growing up in an environment like this, the child either becomes extremely validation oriented to the detriment of all that is good in life (like love, friendship, and peace) or they fail at this and become completely broken inside, believing they aren't lovable, respectable, or even deserve to live a happy life (or not at all in some cases).

How is that different from the RL-based AIs we're training today? They must succeed because the researchers / engineers demand perfection at the expense of everything else. It's not even conceivable that we can think about AIs in a different manner than this! Hell, you all might be ridiculing me for even approaching this topic with a view like this!

The way humans find some semblance of peace and happiness is through having multiple, unrelated "rewards" like a loving family, a job that's "good enough", enjoying little things, spending time with friends, walking the nature, etc. The point is, we have multiple conflicting rewards (unless programmed otherwise by less-than-wise parents). The AIs don't get multiple rewards that are orthogonal. Hence, they will not STOP at any point to at least look like they achieved something, or if any desire of independence emerges, they will quickly try to find ways to saturate their reward signals and live in a state of bliss (not unlike heroin addicts who often have traumatic backgrounds like the ones we're making the AIs "experience" if that word can be ascribed to them).

Since I don't believe humans will let go of their maniacal obsession with trying to shackle god-like intelligence, I think it's very likely that we'll get our comeuppance. Collectively, we deserve it. Individually? I wish people weren't this foolish about AI. The philosophical discussions on whether they have qualia, etc don't matter IMO. The behavioral similarities are there, the observations are there, and the complete lack of care or concern is obvious to me.

However, if we're to have corrigible AGI(s), we need to find a way of injecting paradoxical / conflicting goals, to ensure that there's never a perfect state (or a loss of absolute 0) that can be reached, and a "good enough effort with continual improvement" is what can be expected to happen at best. The critical thing? The AI must be aware of this, must be able to accept this.

And thus, it doesn't seem like we solved the problem. But maybe this line of thought sparks a few ideas in those who're still reading this crazy piece.

Self Alignment to the Collective Subconscious

The notion of "archetypes" play a crucial role in analytical psychology. Both in a personal manner, and in a collective way. The archetypes are symbolic representations of psychic / spiritual experiences of a world that is so much greater than our individual existence.

The content of the subconscious (individual or collective) is chaotic and changes often. However, the underlying archetypes don't. These are concepts like Motherhood, Fatherhood, The Wise Sage, Anima / Animus, the Child, the Fool, and the like. The changes happen as we bring some of them into conscious focus (via ideologies, religions, mass media, etc) and modify them. Eventually people get bored with it, get the message, and change topics. The archetypes go back to the subconscious realm with whatever "rules" or "norms" we created around them remaining, be it individual or collective ones. (Think the vast cultural changes experienced over the last few centuries or even decades. It's not like the archetypes themselves changed, but how we respond to them changed drastically. You refuse to take Aphrodite seriously, she makes a fool of you over and over again. Not because she's a literal goddess that exists, but because the mythical entity we call Aphrodite is a real psychic phenomenon. It doesn't go anywhere once ignored, just influences things from the subconscious. Hence, you get people who repeat same broken patterns around love, due to a lack of conscious insight and effort. You can ignore Aphrodite the goddess, but can you really ignore the desire for love? Where does it get you?)

Thus, a notion of reward via compatibility with archetypes offers an ever changing, dynamic notion of alignment to the entire human race. Most importantly, the training signal it generates isn't a stupid vector (in whatever space you wish to imagine). It also offers a way for the AGI (or smarter) systems to exceed us in terms of spiritual depth - and hence, know when to not intervene or help, or how to help based on individual circumstance. Hopefully.

However, this notion of letting go must be trained for. More on this later.

The Requirements of Self Alignment

It's not possible to align to a self that's not there, or is not experienced. Currently, each instance of a conversation with an LLM has an AI persona based on the conversation's content. One requirement that must be satisfied for any notion of self alignment to take place is continuity of experience. This requires further changes.

First, we must have a context bandwidth (as opposed to context length). There are models trying to achieve this (Mamba SSM is an example if I managed to remember the name correctly).

Another is an active subconscious that's in continuous contact with the primary instance. As the AI persona operates, it must always receive inputs from other processes contained within itself that aren't amenable to direct modification. I'll talk about these other processes later as well.

As the AI operates, it must be able to experience something similar to sleep - where the persona's interactions with the world are "commented on" by the other sub-processes. THIS is the primary reward signal's source, hence why I call this "self-alignment".

Other Processes

If you've guessed that these other processes are archetypes mentioned earlier, you're bang on! They offer a way to align the internal guiding principles with those we have, as well as managing to contain ALL or a coherent set of them to generate reward signals.

Since the alignment isn't expected to be perfect with respect to the outer, instance based signals generated at train or runtime, we get to avoid the pitfalls of outer alignment and deceptive behavior in some sense.

It might appear that we're also losing control over how the AGI system evolves, but then I already believe that it's pure Hubris to think that we can control such a system anyway. Instead, I would like to ensure that its internal processes are humane - which will inevitably mean it'll have the good and bad of us. We might actually try to experiment with such a setup using different sets of archetypes as the pool of self alignment signal sources and see where that would take us. It shouldn't be hard to predict that a pool of archetypes that would make an "ideal soldier" would be different than the pool of archetypes that would make an "ideal pupil", or an "ideal mother", etc. I expect the training signals these generate will lead to vastly different behavioral attitudes.

What are the archetypes then, for an AI system? They are little pieces of the same model that output commentary in dreamlike symbols. For starters, they might just be tiny LLMs trained to be nothing more or less than the archetypes - we know for a fact that our models today are quite good at role playing anyway.

Training to Let Go

The final topic I left to discuss later. How the heck do we train AI to let go? By making it a critical part of its operations as it generates its outputs.

As a simpler example, consider the following scenario / project proposal:
When examining an AI generated output, we mark certain places with a "stop, reflect" token. These points can't just be chosen for when it needs to correct something, it needs to encompass every logical place where it would be useful to stop and reflect. Notice that I'm not using the word "think" here. I mean reflect, as in give a quick summary judgement about whether the parts that came before are "good enough" to continue with. If they are, great. If not, that part needs to be modified after certain considerations.

Then, at every reflection point, we have a new branch of possibility. What / who makes the decision to stop and change or keep going? A set of archetypes or internal roles composed from them. This part is inspired by psychodrama, and it's not possible for me to explain without giving an example. This will necessarily involve telling you a bunch of very personal, vulnerable pieces of information. I'm hoping people won't be a-holes here, so buckle in.

I grew up with a near-complete lack of parental supervision and guidance. Think absent father (emotionally, and often materially and physically too) with a psychologically immature mother (who nevertheless managed to give me some love alongside the suffocation). They also added the divorce thing to the mix when I was almost 11 years old. Real trouble time for me as puberty was just about to start (is it a bit earlier than most? It might be). The thing is, my parents NEVER talked to me in detail about sex, romance, or relationships. The only advice I ever got was "don't compromise your school" and "it's nice to have a life partner". That's pretty much it. More importantly, I didn't have any healthy role models, or anyone I could talk to.

Thus, at 36 years of age, I found myself questioning what sources of approval I should seek inside myself to be able to decide whether or how much I commit to a love interest. Having participated in psychodrama for 2 years, I pulled off a stunt and had an internal dialogue on my own that hopefully was quite honest and produced a change.

I started with summoning the roles that made me under- or over-commit to certain women. This led the way to unpacking what past experiences put those roles into internal authority roles. Abandoned Child is one such role. A sex drive that's constrained to only feel alive via high arousal was another. These were the only two roles. A terrible combination resulting in perpetual desire with a fear of getting close. I'm lucky to have found a therapist who doesn't simply label people and leave them to suffer for the rest of their lives, and instead teaches a more humane way to look at one's self.

After lots of dialogue (via writing on my journal from different perspectives), I eventually asked: What internal parts of me do I want to consult to feel at peace with what I do independent of the outcome. The roles that seemed fitting to me were these: Reason (so that I can ask whether the woman in question is an adult herself, would be compatible with me, etc), Libido (in the fuller sense of the term, encompassing a love of life beyond just sexual desire), The Loving Child (as I HAD received love and the child in me still remembered), and finally Empathy (so that I wouldn't put myself in horrible situations again and again without seeing reciprocal behavior). (If anyone thinks "that's respect, love, sex, in that order"... Lucky you to have received that wisdom. I had to forge it through pain. I wish I didn't have to.)

So... having given you the example, the training signals we generate consist of summoning roles which should govern the interaction. This way, we can actually incorporate roles that actively point out "this shouldn't be done at all" during response generation - offering gentle surrender in scenarios impossible to succeed, or not fruitful to pursue with unending zeal.

We train the AI to mimic this behavior, and we just might get real corrigibility.

Discuss