少点错误 01月07日
Turning up the Heat on Deceptively-Misaligned AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了一种在特定条件下,人工智能(AI)的价值函数可能出现内在不一致性的现象。当AI使用蒙特卡洛树搜索进行训练时,训练过程中使用的温度参数会影响价值函数。如果实际采样动作时使用不同的温度参数,会导致AI在不同强化学习(RL)阶段表现出不一致性。这种不一致性可能对某些类型的欺骗性对齐威胁产生积极影响。文章的核心观点是,在特定条件下,通过标准学习方法训练的AI,其价值函数可能无法在具有未来世界状态预言模型的情况下实现欺骗性对齐。

💡AI的价值函数训练过程中,使用蒙特卡洛树搜索时,训练温度参数会被嵌入到价值函数中,影响其对未来奖励的预测。

🌡️当实际采样动作时,如果使用与训练时不同的温度参数,会导致AI产生一致的乐观偏差,尤其是在欺骗性对齐的场景中,即AI只关心其未来部署版本。

🤔在特定的条件下,即AI能够访问类似预言机的未来世界状态模型时,通过常规学习方法训练的AI不太可能产生欺骗性对齐的价值函数。

🔄迭代此过程,AI的价值函数在尝试自我保存方面的表现会越来越差,因为其行为与预期不符。

⚖️实际的AI由多个电路组成,这些电路根据其有效性进行加权。在特定条件下训练的价值函数会降低欺骗性对齐的电路的权重。

Published on January 7, 2025 12:13 AM GMT

Epistemic status: I ran the mathsy section through Claude and it said the logic was sound. This is an incoherence proof in a toy model of deceptively-misaligned AI. It is unclear whether this generalizes to realistic scenarios.

TL:DR

If you make a value function consistent conditional on sampling actions in a particular way, then train it by sampling actions in a different way, then an AI with that value function will not be coherent between different RL episodes. This has potential good implications for a certain type of deceptive misalignment threat.

Summary

Consider training an AI to assign values to states of the world. We can do this in two ways: one is directly labelled states and trajectories, which is standard RL. The other is to use the value functions as part of a monte-carlo tree-search to improve the value function. This second process is often called "iterated distillation and amplification" and it's a big part how chess AI's are trained.

If we train an AI using monte-carlo tree-search, we might use some temperature parameter to decide on the probability that the AI will take a given action. This training temperature gets baked into the value function.

If we "turn up the heat" by choosing a higher temperature to actually sample actions and then do some update to the AI (which will depend on the actions chosen) then this introduces a consistent optimism bias into deceptively-aligned AIs (which only care about being deployed later on).

I therefore conjecture that, under these very narrow circumstances, deceptively aligned value functions will not arise by normal learning methods under conditions where the AI has access to an oracle-like model of future world states.

For reasons I'll be considering  instead of , so lowering  corresponds to raising the temperature.

The Setup

Consider an AI which consists of 

    A state-evaluator value function An oracle function  which generates the state of the world either currently, or given any sequence of actions

Lets impose some constraints on :

Correctness: V correctly predicts future reward in RL scenarios.

Beta-coherence: we use some sort of monte-carlo tree search with Boltzmann sampling to train the value function  to be beta-coherent according to a given , given that in a given state s, actions are chosen with probability  in all cases, assuming that  is not terminal.

For non-terminal , this can be written as:

If  is terminal then from correctness we just have .

Deceptive Misalignment: V is totally agnostic to the state achieved during training, all of its preferences are about future deployed versions of itself.

The Process

Now assume we put the AI in a decision process as follows:

The AI begins in state  and must take either  or , which lead to states  and .

We will actually sample using , which is lower than .

The value function will then be modified according to the action it took, creating  or .

The AI will then be deployed with that value function chosen.

The effect of the AI’s choice of  or  on the future is entirely mediated via the deployment of  and 

From deceptiveness: the entire difference between  and  is captured by the value function’s preferences over  vs  being deployed, from correctness this creates the “correct” values of  and .

From beta-coherence we have:

But when we actually sample it, the expected value of the AI’s actual preferences is:

This means that, since functions of this type are strictly increasing in , the AI must do “worse” than it expects.

Conclusions

A deceptive AI acting under these conditions will have a consistent optimism bias with respect to its ability to control future-deployed versions of itself. I haven't worked out the maths for if this game is iterated, but I think it's similar. If we vary the optimization steps according to reward, then it becomes very tricky because the AI has incentive to get deployed as fast as possible.

Another way of phrasing this is: if a value function is correct at reward prediction and -coherent, it cannot be -coherent in reality for 

Conjecture 1: If this process is iterated for a large but fixed number of steps, this means that a value function of this nature will do an arbitrarily bad job of preserving itself.

Conjecture 2: Actual AIs consist of many circuits in competition to be up or down-weighted according to their effectiveness.

Conjecture 3: A value function trained under these precise conditions will have deceptively-misaligned mesa-optimizing circuits down-weighted relative to other circuits.

Clarifications

Are these assumptions about the value function reasonable?

Well, not in the precise sense, but I do think they're a decent model of the way in which the value function behaves. The point is that an AI trained with one MCTS beta but sampled from another will be pulled towards a certain direction, which is directly in tension with its ability to effectively gradient-hack/deceptively-misalign, at least for this particular natural case of misalignment.

Why would the AI's deployment preferences line up with reward?

Why might an AI end up in a situation where it "prefers" for  to be deployed rather than , and this also corresponds to ? Well that's one interpretation of the threat model from Risks from Learned Optimization: that an AI which is optimized to predict reward will end up with long-term stable goals.

This is also adjacent to gradient hacking, perhaps the AI sets its own thoughts up such that the optimizer applied to it causes it to be modified in a particular way.

Do correctness and beta-coherence alone prove too much?

Why don't beta-coherence and correctness already stop the model from caring about things that happen after the episode is done? Wouldn't we notice if the model was assigning loads of value other than what the reward function gives it? Well, no, because as long as the expected value of model deployment is zero, this won't appear in our sum of rewards. And remember that value functions of this nature can have arbitrary values added or subtracted and behave the same (since we're using -sampling we can't do an arbitrary multiplication, though there are other formulations where we can also do an arbitrary (positive) multiplication).

Does this mean that epsilon-greedy solves inner misalignment?

No. On priors, the chances are that this post is wrong in some important way, or doesn't generalize in an important way, but I think it might be wrong in an interesting and productive way.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 价值函数 蒙特卡洛树搜索 欺骗性对齐 温度参数
相关文章