少点错误 2024年07月19日
Activation Engineering Theories of Impact
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

激活工程是一种通过修改模型的激活来控制模型行为的技术,它可以作为提示工程和微调的补充,在运行时以低开销的方式引导模型。激活工程有望实现对大型语言模型(LLM)的低成本价值对齐,并提供对模型表示的洞察,同时还能防御恶意输入,并为自动化对齐研究人员提供足够控制。

📖 激活工程能够灵活地重新定位大型语言模型的行为,而不会损害其整体性能。这可能是通过改变模型当前活跃的目标(混合)来实现的。适当开发,激活工程方法可以实现安全进步,同时产生非常低的'对齐税'。

📡 激活工程可以作为一种工具,用于自上而下的可解释性,类似于激活修补/消融用于机制可解释性。这将使我们能够重新定位搜索并返回,并可能在持续的'可解释性 ⇄ 指导'循环中迭代地改进两者。这可能会导致一种基于激活向量的新技术,但它是一种更强大的对齐工具。

📈 指导向量在推理时提供了针对人工智能滥用风险的最后一道防线,使我们能够在最后一步(推理期间)控制模型行为。实时校正可以防止有害或意外的输出,即使面对恶意尝试操纵模型,例如提示注入或越狱。

📫 激活工程可能是一种足够的对齐技术,适用于接近人类水平的自动化对齐研究人员(AAR)。这可能导致一个良性循环,使人类-人工智能研究团队能够更好地对齐更大的模型。

📲 激活工程与其他对齐技术结合使用,可能能够在瑞士奶酪模型方法中提高整体安全性。

Published on July 18, 2024 4:44 PM GMT

Below I summarize the thoughts of other people on what's the Theory of Impact for Activation Engineering. I mostly base it on the "discussion" parts of the papers and the answers under @Chris_Leong's post What's the theory of impact for activation vectors?

Alex Turner's posts on controlling a maze-solving policy network, and a paper on steering GPT-2-XL by adding an activation vector, introduced activation engineering, a set of "techniques which steer models by modifying their activations". 

As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime.

Over a year later and there's a Slack server to coordinate research projects and propose new ideas or open problems in this area. This is, to my knowledge, the closest we have to a practical implementation of retargetting the search

ToIs

Low-tax value alignment of LLMs

The original "activation addition" paper claims that

activation engineering can flexibly retarget LLM behavior without damaging general performance. We speculate that this involves changing the model’s currently-active (mixture of) goals. Suitably developed, the activation engineering approach could enable safety progress while incurring a very low ‘alignment tax’ 

Alex Turner claims (as far as I understand) that that steering vectors can significantly enhance a model's performance on key aspects of safety, such as truthfulness, reducing the tendency to hallucinate or generate false information, minimizing sycophancy or overly agreeable responses, discouraging power-seeking behavior, and mitigating myopic tendencies.

Activation vectors don't work in isolation; they can be effectively combined with existing techniques like prompting and fine-tuning, leading to even greater improvements. This means that activation vectors may represent a valuable addition to the growing arsenal of tools available to AI researchers and developers striving to create more aligned and beneficial AI systems.

Insight into model representations

Most of productive research in this area will tell us something about how neural networks work and this seems to be a net positive unless it capabilities advancement offset the benefits to safety. This is the same dilemma that we have in case of mechanistic interpretability.

Activation engineering could specifically be used as a tool for top-down interpretability in a similar way activation patching/ablation is used for mechanistic interpretability. 

This then would bring us to retargetting the search and back, and we might iteratively improve on both in a constant "interpretabilitysteering" loop. This could lead to a new technique that builds upon activation vectors, but is a more powerful alignment tool.

Some recent safety techniques inspired by representation engineering include Representation Misdirection for Unlearning (RMU) and Short Circuiting for adversarial robustness.

Defense against malicious inputs at inference time

Steering vectors offer a last line of defense against AI misuse risk by giving us control over model behavior at the last possible step - during inference. 

Real-time corrections could prevent harmful or unintended outputs, even in the face of malicious attempts to manipulate the model, like prompt-injections or jail-breaks.

Razor-sharp control 

Similarly Turner claimed that the advantage of ActEng over techniques like RLHF is that activation engineering could help us avoid failure modes of optimization based on human feedback.

I like the "single-bit edits" analogy provided by @mishajw. Traditional methods like pre-training or fine-tuning change many parts of the program at once, making it hard to predict how the behavior will be affected. Steering vectors, on the other hand, allow us to isolate and modify specific aspects, potentially making it safer and more predictable. 

This way, we avoid further training that might result in new circuits being learned.

Good enough control for AARs

Some speculate that more advanced (robust to distribution shifts) AIs should converge towards having almost same causal world models which should be reflected in linear structures inside the network. Therefore we might expect linear activation/representation engineering methods to work the same, or even better, in those more powerful models. But activation engineering does not have to live up to this expecation and be a silver bullet remedy. 

However, it might be a sufficient alignment technique for ~human-level automated alignment researchers (AARs).  This could lead to a virtuous cycle where human-AI research teams become better at aligning bigger models.

For that purpose, steering vectors may not need to be exceptionally robust if combined with other alignment techniques in a Swiss cheese model approach to improve overall safety.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

激活工程 AI安全 大型语言模型 价值对齐
相关文章