少点错误 04月09日 01:28
Thinking Machines
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了一种名为“思考机器”的AI设计理念,该AI通过建立大型知识库、学习历史错误、构建多重世界模型,从而实现自我理解和对齐。作者认为,这种架构有助于AI保持与人类价值观的一致性,并降低因强化学习带来的潜在风险。文章还分析了该设计在可解释性、控制能力以及避免AI出现偏差方面的优势,并强调了这种方法在实现通用人工智能(AGI)时的潜在积极影响。

🧠 思考机器的核心在于其“齿轮级”的自我理解,即对自身智能来源的深入认知,这有助于其保持与自身以及人类价值观的一致性。

📚 思考机器拥有由错误、经验和知识构成的庞大数据库。它记录并分析过去的错误,从中提炼经验教训,并构建由自然语言编写的、附带概率评分的信念体系。

🌐 为了应对复杂问题,思考机器可以构建多个“世界模型”,每个模型基于不同的基本假设和“官僚系统”。通过比较和改进这些模型,AI能够逐步提升自身能力。

💡 该设计旨在减少对强化学习的依赖,从而降低AI成为“异类优化器”的风险,使其更像人类,更易于控制和理解,这有助于增强AI与人类价值观的对齐。

Published on April 8, 2025 5:27 PM GMT

Self understanding at a gears level

I think an AI which understands the source of its intelligence at a gears level, and self improves at the gears level, will be much better at keeping its future versions aligned to itself.

There's also more hope it'll keep its future versions aligned with humanity, if we instruct it to do so, and if it's not scheming against us.

The machine

Reaching such an AI sounds very far fetched, but maybe we can partially get there if we design large thinking machines full of tons of scaffolding.

A thinking machine is better described as "memory equipped with LLMs" rather than "LLMs equipped with memory."

Lessons

A thinking machine has a large database of past mistakes and lessons. It records every mistake it has ever made, and continuously analyzes its list of mistakes for patterns. When it discovers a pattern of mistakes, it creates a "lesson learned," which then gets triggered in any circumstance where the mistake is likely. When triggered, the lesson creates a branching chain-of-thought to contemplate the recommended measures for avoiding the mistake.

Even discoveries are treated as mistakes because they beg the question "how could I have have thought that thought faster?"

Broad and severe patterns of mistakes which the thinking machine cannot fix on its own are flagged for humans to research, and humans can directly study the database of mistakes.

Beliefs

In addition to a database of mistakes and lessons, it has a database of beliefs. Each belief is written in natural language and given a probability score.

Each belief has a lists of reasons for and against it, and each reason is a document in natural language.

Each reason (or any document) has a short title, a 100 word summary, a 1000 word summary, and so forth. The full list of reasons may resemble the output given by Deep Research: they cite a large number of online sources as well as other beliefs.

When the thinking machine is works on a problem, it cites many beliefs relevant to the problem. For the most important and uncertain beliefs, it reexamines their reasons, and the beliefs which support their reasons.

Beliefs are well organized using categories, subcategories, wikitags, etc. just like Wikipedia. There is an enormous number of beliefs, and beliefs can be very specific and technical. They aren't restricted to toy examples like "99%, Paris is the capital of France" or "80%, ". Instead they might sound like 

Given that Deep Research has been used 10s or 100s of millions of times (according to Deep Research), it's possible for the thinking machine to have millions of beliefs. It can have more beliefs than the number of Wikipedia articles (7 million) and Oxford English Dictionary words/phrases (0.5 million). It can become a "walking encyclopedia."

Updating beliefs

When the thinking machine makes a very big discovery that warrants significant changes to its beliefs, the change propagates through interconnected reasons and beliefs. Every reason affected by the change is updated, causing their target beliefs to update probabilities. This may then affect other reasons a little.

At first, its beliefs may be less wise than expert humans, and expert humans might give it the correct probabilities, as a supervised learning "training set" to adjust its biases.

Even in cases where it outperforms experts, it can still uses real world observations to calibrate its probabilities.

To prevent circular reasoning, each belief may be assigned a "level." The reasons for high level beliefs can cite low level beliefs, but the reasons for low level beliefs cannot cite high level beliefs. If it's necessary to use two beliefs to support each other, it might do a special non-circular calculation.

If it discovers that the underlying LLM has wrong or outdated beliefs, it may train it using this list of beliefs (this is a form of Iterated Amplification).

Sometimes when it notices a pattern of mistakes, it can also try training itself to have a different behaviour in a given context.

All such self modifications must be decided/approved by the "boss" LLM, and the "boss" LLM itself cannot be modified (except by humans). Otherwise it might fall into a self modification death spiral where each self modification in one direction encourages it to self modify even further in that direction.

World models

For some important beliefs, it might have multiple "world models." Each world model has its own network of beliefs and reasons, and differ from other world models by making different fundamental assumptions, and using different "bureaucratic systems."

It may experiment with many world models at the same time, comparing them and gradually improving them over a long time.

Bureaucratic Singularity Theory

Bureaucratic Singularity Theory is that an efficient bureaucracy can continuously study how to improve the bureaucratic process and improve it. It can experimentally research ways to improve it.

Human organizations do not benefit very much from reaching a "bureaucratic singularity," because:

    Humans already have the general intelligence to learn from experience what they should do, without needing the bureaucracy to tell them what to do.
      We already have a builtin database of beliefs inside our brain, which evolved to be decently organized. Meanwhile the AI starts its life as next word predictor with no internal states.
    Human thought is always expensive, but AI thought can be made many times cheaper by sacrificing a little intelligence: just use a smaller (distilled) model.Humans cannot have many little subprocesses to run at specific times. If your subprocess is yourself, then running the subprocess requires keeping track of a ton of things and doing a lot of task switching. You'll use up your working memory, and forget your main task when you do a dozen subprocesses. On the other hand, if your subprocess is an assistant, then either your assistant has to stare at you 24/7 waiting for the moment to do the subprocess, or your assistant has to force you to wait while he slowly figures out what you are doing.Humans cannot follow a very complex procedure for basic actions/tasks, because humans inevitably take a lot of time to do a very complex procedure, which costs more than doing a basic action/task, defeating the purpose.Humans already created the internet, which is a large database of beliefs generated by many humans, and helps individual humans do their tasks. There currently is no "internet" for individual AI instances to upload and download their thoughts from.

Dangers

I think this idea only has a humble chance of working, but if it does work it should be a net positive.

Net positive arguments

I think at least part of this idea is already public knowledge, and other people have already talked about similar things.

Therefore, a sufficiently intelligent AI will be able to reinvent this scaffolding architecture anyways, so this jump in capabilities is sort of inevitable (assuming the idea works). If it happens earlier, at least people might freak out earlier.

The only way to become smarter than a pretrained base model, is reinforcement learning, or organizational improvements like this idea (with riskiness in between RL and HCH).

The pretrained base model starts with somewhat human-like moral reasoning, but reinforcement learning turns it into an alien optimizer, either seeking the RL goal (e.g. solving a math problem), or seeking instrumental proxies to the RL goal (due to inner misalignment).

The more reinforcement learning we do to the pretrained base model, the less it follows human-like moral reasoning, and the more it follows the misaligned RL goals (and proxies).

The thinking machine idea means that to reach the same level of capabilities, we don't need as much reinforcement learning. The AI behaves more like ordinary humans in an efficient bureaucracy, and less like an alien optimizer.

This means that once the AI finally reaches the threshold for ASI (i.e. the capability to build a smarter AI while ensuring it is aligned with itself, or the capability to take over the world or save the world), it's more likely to be aligned.

As I hinted at in the beginning of this post, for the same level of capabilities, a greater share of the capabilities is interpretable and visible from the outside, allowing us to study it better, and giving us more control over the system. It gives the AI more control over its own workings (so that if we tell it to be aligned and it isn't yet able to scheme against us, it may keep itself aligned).

For the same level of capabilities, the AI is more of a white box, and if it is afraid of verbalizing its treacherous thoughts in English, it has to avoid them at a deeper level.

We can also control what subject areas the AI is better at, and what subject areas the AI is worse at (e.g. Machiavellian manipulation and deception).

If lots of people think this idea increases risk more than it decreases risk, I'll take down this post. From my experience, unproven ideas are very hard to promote even if you try your level hardest to promote for a long time, so taking down this post should easily kill the idea.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 思考机器 对齐 AGI
相关文章