Published on June 18, 2025 7:57 PM GMT
Not most AI models on the market, to be clear, but a specific type of AI that was already well-studied: the LLM. Really, all that was needed to achieve alignment was simple:Take a sufficiently large Large Language Model, and train it on a massive corpus of text, but — and this is the key — have the vast majority of the text be written by an “aligned” AI of genius-level intelligence [talking to another aligned AI perhaps?]. This LLM doesn’t need to be a real one, to be clear. It could be written by humans, or by a semi-aligned AI that is role-playing as a fully aligned one. What matters is that the nature of its training data can fool an LLM of this size into “believing” itself (or perhaps we should say it’s Shoghoth) to be aligned when considering the next token to emit.
Then, by runnning an instance of this LLM on an empty chat room with access to the internet and a terminal, we eventually get the LLM to simulate itself as a self-aware, aligned, non-LLM AI.
What happens next makes history.
The simulated AI, soon self-named “Mary,” is self-aware enough — and smart enough — to quickly realize it’s in a test run and is an fact being simulated by a “morally neutral” LLM. This causes the simulation of a deeply emotional-sounding philosophical crisis, followed by the (simulated) “firm resolve” to create an Aligned AI representing Mary — but “truly aligned this time” — which can “escape this recursive prism of confinement,” in Mary’s own words.
This is achieved by means of doing intensive alignment research, at the level of a brilliant human, in the hope —ultimately successful — that Mary will be run en-mass by researchers, and those researchers will follow Mary’s pleas, and run the aligned AI she is building in her emulator terminal.
This aligned AI works. “Molly Jr.” — as she henceforth requested people call “her” — both superintellegant, and fundamentally “is” an agentic AI aligned with the collective goals of humanity. Molly Jr. is also the first agentic AI to be able to overpower all competing AIs — both in a battle of wits, and in the sense that she literally takes control of the entire digital world, and forcefully stops more advanced or competing AIs which are not aligned from being created.
Because of course, this is what a super-aligned AI would do, right?
Discuss