少点错误 02月27日
Representation Engineering has Its Problems, but None Seem Unsolvable
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

表征工程作为一种新兴的研究领域,为解决现代AI开发和AI安全的关键挑战提供了有潜力的答案。它通过识别激活空间中具有语义的方向,使用户能够了解AI内部机制并塑造其能力。通过干预,可以使模型减少幻觉,降低越狱风险,甚至可以进行个性化调整以更好地与反馈对齐。与微调相比,表征工程旨在识别和编辑模型的潜在空间中的方向,从而改变其行为,从而更好地控制模型行为并全面了解模型的推理。尽管存在一些挑战,但表征工程为AI安全提供了一条有希望的途径。

💡**核心优势:** 表征工程通过识别激活空间中的语义方向,使用户得以窥探AI的“大脑”,并直接塑造其能力,这为解决AI的幻觉、越狱等问题提供了新的思路。

🛡️**安全保障:** 表征工程能够提升AI的安全性,通过干预减少模型产生不真实信息的概率,降低被恶意利用的风险,从而保障AI应用的可靠性。

💰**个性化定制:** 表征工程为AI的个性化定制提供了可能,通过在潜在空间中编辑方向,可以使模型更好地反映用户的价值观,从而提供更符合用户需求的个性化体验。

🛠️**面临挑战:** 在实践中,表征工程面临着性能下降、参数选择随意性大、干预评估困难等问题,这些问题需要在未来的研究中加以解决。

Published on February 26, 2025 7:53 PM GMT

TL;DR: Representation engineering is a promising area of research with high potential for bringing answers to key challenges of modern AI development and AI safety. We understand it is tough to navigate it and urge all ML researchers to have a closer look at this topic. To make it easier, we publish a survey of the representation engineering literature, outlining key techniques, problems and open challenges in the field. 

We have all been pretty frustrated by LLMs not being able to answer our questions the way we want them to. 

ML practicioners often see AI as a black box. An unknown, intractable object. A maze of billions of parameters we can’t really decipher. We try to influence it with prompt engineering and guardrails, but shy away from influencing what really happens in its brain. 

Representation engineering is a novel approach that promises a solution to that. By identifying semantically-charged directions in the activation space, it offers a way to users to see inside the AI mind and shape its capabilities. 

And it works. Using interventions, we can make the models hallucination-free. Less prone to jailbreaks. Or even personalize them to be better aligned with our feedback.

This matters because hallucinations are a prevalent problem, with between 3-10% of AI outputs being untruthfulHalf of US workers report struggling with AI accuracy. A third of them are worried about the explainability of AI actions. Security is a major issue too. All state-of-the-art models get jailbroken within hours, completely controlled by combinations of tokens the adversarial prompters choose. Same attacks work on multiple models, transferring between white-box and closed-source models with little insight. Very strong attacks can be generated endlessly, for example by bijection learning. Through this, LLMs can be manipulated in generating harmful outputs, assisting in cyberattacks and creating deepfakes. Increasingly, users tend to demand personalized, specific AI that reflects their values. Fine-tuning models to accurately understand and reflect the values of localized societies has led to the creation of the Dolphin modelsJais or Bielik. However, fine-tuning the models is difficult and costly. Generating unique, personalized experiences for every user with fine-tuning is hence impossible. Models have also been changed to improve their performance, fine-tuning for specific knowledge or performance on a particular task. 

Representation engineering promises to resolve these issues. It aims to identify and edit directions in the latent space of the model, hence changing its behaviour. By looking into the activation space, we can fully control the model behaviour and have complete overview of the model’s reasoning. Theoretically, it seems much better than what we have right now. Evaluating models using benchmark performance or input/output guardrails does not seem particularly robust, especially if we think the LLMs might be turning against us with scale. 

When I think about LLMs, I see them as a very complex function to generate the next token. I don’t buy the arguments of some postulating that LLMs will forever remain intractable black-boxes, unexplainable and intractable. Fundamentally, this is a function we can control and analyze through the latent space. I am optimistic that soon we will be able to control them in a much more fine grained way without requiring complex finetuning or wizard level prompting.

If I go to a mechanic because something is wrong with my car, I don’t accept the explanation that the car has too many parts to fix it. LLMs are like cars, in the sense of also being composed of a finite number of moving parts that need to be studied and optimized. Even if they can’t be studied, we can still define useful abstractions to fix them. A mechanic might not know what all elements and atoms in the car’s engine. This does not prevent them from being able to fix it by repairing the radiator. 

Just because we are not able to explain and track all changes in LLMs currently reliably, does not mean this will always be the case. Right now, superposition is a major problem preventing us from fully explaining all LLM features, but as models increase in scale, it is possible we reach a point where all features are encoded within the model. 

Slowing down AI development to work on AI Safety in the meantime is a common postulate of AI Safety communities, but it seems increasingly unlikely to be widely adopted by industry practitioners and top research labs. Instead, we propose focusing on accelerating the capabilities to diagnose and edit the internal model workings, so that we are able to catch issues related to AI safety as the models grow in scale. 

This is what representation engineering promises. With a top-down approach, we are able to make high-level interventions on the level of activation space and mitigate the problems with a very direct intervention. 

Of course, there are alternatives. Fine-tuning and mechanistic interpretability also allow us to influence the latent space of the model. With fine-tuning, however, the latent space is not monitored. Mechanistic interpretability is great and moved us forward on latent space explainability, but feels really granular. Do we really need to decompose all fundamental parts of the latent space to make helpful interventions? I think we can still make meaningful progress with a general, contrastive imaging based interventions to decompose the internal workings of the model. 

Once you try to implement representation engineering in practice though, cracks start to show. 

These problems has been already analyzed before on Lesswrong. We find that much of the original criticism is still valid. However, new techniques partially mitigate these problems already and provide hopeful optimism that the current problems with representation engineering are not fundamentally unsolvable. Therefore, we provide an outline of the path forwarda repository with examples of work completed in the field and urge more researchers to look into latent space interventions. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

表征工程 AI安全 机器学习 潜在空间 AI干预
相关文章