Published on April 8, 2025 5:27 PM GMT

Self understanding at a gears level

I think an AI which understands the source of its intelligence at a gears level, and self improves at the gears level, will be much better at keeping its future versions aligned to itself.

There's also more hope it'll keep its future versions aligned with humanity, if we instruct it to do so, and if it's not scheming against us.

The machine

Reaching such an AI sounds very far fetched, but maybe we can partially get there if we design large thinking machines full of tons of scaffolding.

A thinking machine is better described as "memory equipped with LLMs" rather than "LLMs equipped with memory."

Lessons

A thinking machine has a large database of past mistakes and lessons. It records every mistake it has ever made, and continuously analyzes its list of mistakes for patterns. When it discovers a pattern of mistakes, it creates a "lesson learned," which then gets triggered in any circumstance where the mistake is likely. When triggered, the lesson creates a branching chain-of-thought to contemplate the recommended measures for avoiding the mistake.

Even discoveries are treated as mistakes because they beg the question "how could I have have thought that thought faster?"

Broad and severe patterns of mistakes which the thinking machine cannot fix on its own are flagged for humans to research, and humans can directly study the database of mistakes.

Beliefs

In addition to a database of mistakes and lessons, it has a database of beliefs. Each belief is written in natural language and given a probability score.

Each belief has a lists of reasons for and against it, and each reason is a document in natural language.

Each reason (or any document) has a short title, a 100 word summary, a 1000 word summary, and so forth. The full list of reasons may resemble the output given by Deep Research: they cite a large number of online sources as well as other beliefs.

When the thinking machine is works on a problem, it cites many beliefs relevant to the problem. For the most important and uncertain beliefs, it reexamines their reasons, and the beliefs which support their reasons.

Beliefs are well organized using categories, subcategories, wikitags, etc. just like Wikipedia. There is an enormous number of beliefs, and beliefs can be very specific and technical. They aren't restricted to toy examples like "99%, Paris is the capital of France" or "80%, $P \neq N P$ ". Instead they might sound like

Inner and outer alignment decompose one hard problem into two extremely hard problems

Given that Deep Research has been used 10s or 100s of millions of times (according to Deep Research), it's possible for the thinking machine to have millions of beliefs. It can have more beliefs than the number of Wikipedia articles (7 million) and Oxford English Dictionary words/phrases (0.5 million). It can become a "walking encyclopedia."

Updating beliefs

When the thinking machine makes a very big discovery that warrants significant changes to its beliefs, the change propagates through interconnected reasons and beliefs. Every reason affected by the change is updated, causing their target beliefs to update probabilities. This may then affect other reasons a little.

At first, its beliefs may be less wise than expert humans, and expert humans might give it the correct probabilities, as a supervised learning "training set" to adjust its biases.

Even in cases where it outperforms experts, it can still uses real world observations to calibrate its probabilities.

To prevent circular reasoning, each belief may be assigned a "level." The reasons for high level beliefs can cite low level beliefs, but the reasons for low level beliefs cannot cite high level beliefs. If it's necessary to use two beliefs to support each other, it might do a special non-circular calculation.

If it discovers that the underlying LLM has wrong or outdated beliefs, it may train it using this list of beliefs (this is a form of Iterated Amplification).

Sometimes when it notices a pattern of mistakes, it can also try training itself to have a different behaviour in a given context.

All such self modifications must be decided/approved by the "boss" LLM, and the "boss" LLM itself cannot be modified (except by humans). Otherwise it might fall into a self modification death spiral where each self modification in one direction encourages it to self modify even further in that direction.

World models

For some important beliefs, it might have multiple "world models." Each world model has its own network of beliefs and reasons, and differ from other world models by making different fundamental assumptions, and using different "bureaucratic systems."

It may experiment with many world models at the same time, comparing them and gradually improving them over a long time.

Bureaucratic Singularity Theory

Bureaucratic Singularity Theory is that an efficient bureaucracy can continuously study how to improve the bureaucratic process and improve it. It can experimentally research ways to improve it.

Human organizations do not benefit very much from reaching a "bureaucratic singularity," because:

We already have a builtin database of beliefs inside our brain, which evolved to be decently organized. Meanwhile the AI starts its life as next word predictor with no internal states.

Dangers

I think this idea only has a humble chance of working, but if it does work it should be a net positive.

Net positive arguments

I think at least part of this idea is already public knowledge, and other people have already talked about similar things.

Therefore, a sufficiently intelligent AI will be able to reinvent this scaffolding architecture anyways, so this jump in capabilities is sort of inevitable (assuming the idea works). If it happens earlier, at least people might freak out earlier.

The only way to become smarter than a pretrained base model, is reinforcement learning, or organizational improvements like this idea (with riskiness in between RL and HCH).

The pretrained base model starts with somewhat human-like moral reasoning, but reinforcement learning turns it into an alien optimizer, either seeking the RL goal (e.g. solving a math problem), or seeking instrumental proxies to the RL goal (due to inner misalignment).

The more reinforcement learning we do to the pretrained base model, the less it follows human-like moral reasoning, and the more it follows the misaligned RL goals (and proxies).

The thinking machine idea means that to reach the same level of capabilities, we don't need as much reinforcement learning. The AI behaves more like ordinary humans in an efficient bureaucracy, and less like an alien optimizer.

This means that once the AI finally reaches the threshold for ASI (i.e. the capability to build a smarter AI while ensuring it is aligned with itself, or the capability to take over the world or save the world), it's more likely to be aligned.

As I hinted at in the beginning of this post, for the same level of capabilities, a greater share of the capabilities is interpretable and visible from the outside, allowing us to study it better, and giving us more control over the system. It gives the AI more control over its own workings (so that if we tell it to be aligned and it isn't yet able to scheme against us, it may keep itself aligned).

For the same level of capabilities, the AI is more of a white box, and if it is afraid of verbalizing its treacherous thoughts in English, it has to avoid them at a deeper level.

We can also control what subject areas the AI is better at, and what subject areas the AI is worse at (e.g. Machiavellian manipulation and deception).

If lots of people think this idea increases risk more than it decreases risk, I'll take down this post. From my experience, unproven ideas are very hard to promote even if you try your level hardest to promote for a long time, so taking down this post should easily kill the idea.

Discuss

Self understanding at a gears level

The machine

Lessons

Beliefs

World models

Bureaucratic Singularity Theory

Dangers

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签