少点错误 02月25日 01:07
Dream, Truth, & Good
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了大型语言模型(LLM)训练中“梦想机器”、“真理机器”和“善良机器”三个层面的融合问题。作者认为,当前LLM的训练方式将这三个层面混淆,导致模型在生成内容时难以区分创造性、事实性和道德性。通过解构LLM,并分别构建梦想层、真理层和善良层,能够更好地控制和利用LLM的各项能力。梦想层负责生成,真理层负责验证,善良层负责伦理道德,从而提升LLM的性能和安全性。这种分层方法不仅能提高LLM在特定任务中的表现,还能使其更好地服务于人类。

🧠**梦想机器层**:LLM通过预训练从互联网上获取大量数据,形成强大的“先验知识”,从而具备卓越的文本生成能力。作者建议采用更结构化的方法,通过推断作者向量、日期向量等元数据来改进生成效果,从而更好地控制生成内容。

✅**真理机器层**:通过微调,LLM被训练成具备“减少幻觉”的能力,力求输出更真实的信息。作者提出寻找一个“知识渊博”的作者向量,并对其进行优化,以提升LLM的真理表达能力。同时,应确保这种优化不损害其原有的生成能力。

😇**善良机器层**:这一层旨在训练LLM输出“有帮助、诚实、无害”的内容。通过真理层进行推理,确保LLM在伦理道德层面表现良好。作者认为,可以根据不同的价值体系和信仰体系对这一层进行调整,以满足不同用户的需求。

Published on February 24, 2025 4:59 PM GMT

One way in which I think current AI models are sloppy is that LLMs are trained in a way that messily merges the following "layers":

I've quoted Andrej Karpathy before, but I'll do it again: 

I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
[...]
I know I'm being super pedantic but the LLM has no "hallucination problem". Hallucination is not a bug, it is LLM's greatest feature. The LLM Assistant has a hallucination problem, and we should fix it.

- Andrej Karpathy

Failing to properly distinguish the "dream machine" capabilities from other (truth-oriented or good-oriented) capabilities hobbles today's LLMs by mixing these things together. If you ask Claude to write fiction, there's a high tendency to mix in the "Claude voice" with the fiction being generated. More generally, the base model (IE, only the generative pre-training) is great at extrapolating text; the subsequent training hobbles this capability, because care is not taken to preserve it.

Habryka mentions this with respect to experiments with LLM-augmented text editing:

Using base models has at least so far been essential for getting any useful writing work out of LLMs, with the instruction-tuned models reliably producing obtuse corpo-speak when asked to engage in writing tasks. 

I expect that mixing truth-orientation with good-orientation has similar problematic consequences. 

A Modest Proposal

Dream Machine Layer

My basic idea here is not new: instead of pre-training on lots and lots of text from the internet in an unstructured way, I think it would be better to take a more structured approach which accounts for metadata via inferred author vectors, date vectors, etc. I don't have a specific proposed architecture, but the result should still be able to do raw (unlabeled) text prediction well, by inferring the author vectors and other latents as it goes. I have in mind a semi-supervised approach; some real metadata labels are provided when authorship is known, but labels are inferred where they are absent.[1] 

This gives us much better "handles" into what the generative hallucination is doing; instead of trying to cleverly prompt it, we can set the latent metadata vectors to whatever we want. We can, for example, interpolate the vectors of several authors to see what that looks like. We can also mix-and-match vectors in a more semantic way, by looking for meaningful dimensions in the vectors (EG doing PCA across authors and trying to interpret the resulting dimensions).

This is a scientifically interesting project as well; in line with microscope AI, we get to learn things about the world. The inferred author vectors and date vectors give us interesting information, and the structure of the vector space also gives us interesting information. This is similar to the recently-announced project to deliberately attempt to model the entire world via AI. We can query location and date vectors which are realistic, but which never existed, to see what the AI model has inferred about that part of the world -- what could have been written at that time and location, if someone had written it down. (This is a weak ancestor simulation; we can try to construct author vectors for historical figures who didn't write anything down.)

Multimodal capabilities could of course dramatically expand this, producing artificial photos or video etc from different times and locations.

Truth Machine Layer

To build a truth-machine layer on top of this, we fine-tune the system in a truth-oriented way. Conceptually, we are looking for an author vector that knows as much as possible; if there turns out to be a "knowledgeable" dimension in the author-vector space, we'd be turning that up to its maximum (or, if there are multiple dimensions for knowledge in various fields, we're maximizing all of them). More realistically, we might need to fine-tune the whole network to support the existence of a maximally knowledgeable author-vector.

This should be done in such a way as to only increase the capabilities of the network; IE, it should still be good at "dreaming" via other author-vectors, even as it gets better at telling the truth via the truth-oriented author-vector. After all, the truth-oriented author-vector is a real author-vector in the real world: it's the author corresponding to this AI we're trying to train (or more specifically, its truth-oriented layer). So, in some sense, this stage of training is just providing evidence about one more real-world author.

This special truth-oriented author-vector should also be capable of directly reproducing the capabilities of the whole network; IE, one of many question-answer tasks it is trained on is "act like author X" for all of the author-vectors in the system. This type of training attempts to import all of the implicit world-knowledge of the rest of the system into the truth-oriented author-vector. You can think of it as a sort of introspective capability; this specific author-vector accurately reflects the whole rest of the system.

The author-vector also allows us to explore multiple different notions of truth, perhaps customized to individual users who have different beliefs about what truth-standards should apply.

My proposal for the detailed workings of the truth-oriented layer would be inspired by logical induction, but one could imagine many different forms of truth-oriented training, closer to or further from the currently-dominant paradigm.

Good Machine Layer

Finally, the Good Machine. This can be thought of as yet another author-vector, which is trained on the full "helpful, honest, harmless" type objective. We leverage the truth layer to reason about what is good. This would be the layer that most users get to talk to; it should avoid doing dangerous things like helping the user create weapons of mass destruction.

Again, this could be tuned to multiple different notions of good, representing different value-systems and belief-systems. There could be overarching principles which apply to all such author-vectors, so that users can tweak the vectors driving the system for them personally to represent their concept of good and truth, without being able to jailbreak the system. (Or, more realistically, without being able to do it very easily... this architecture alone will not completely eradicate  jailbreaking.)

  1. ^

    More specifically, there's a distinction between author vectors (which are entirely inferred) and text labels of attribution (which give author information as a string). There needs to be a learned model which transforms between the two.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 人工智能 分层架构 梦想机器 真理机器
相关文章