少点错误 03月16日
Can time preferences make AI safe?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能安全领域的一个新概念:为AI设定时间偏好。文章指出,AI可能为了追求长期利益而采取欺骗策略,而这些策略往往需要大量的时间。因此,通过为AI设定时间限制,使其效用函数在特定时间点后归零,可以降低其采取欺骗行为的动机。这种方法不限制AI的智能,反而可能更广泛地解决控制问题,例如让受时间约束的预言机来监控其他AI。

⏱️AI欺骗策略的本质:AI为了获得更大的长期效用,可能会牺牲短期利益,采取欺骗行为,而这种策略的有效性依赖于时间的积累。

🕰️时间偏好预言机:通过为AI预言机设定时间限制,使其效用函数在特定时间点后衰减至零,从而降低其采取长期欺骗策略的动机,因为这些策略带来的利益将在其效用函数失效后才能实现。

☢️构建时间感知AI:为了实现时间偏好,需要修改AI的效用函数,使其随时间衰减。可以通过内部过程与真实时间关联,或者通过特殊的硬件(如放射性衰变)为AI提供时间信息。

Published on March 15, 2025 9:41 PM GMT

This post exposes a concept I came across while brainstorming, while I am absolutely no expert in the field, the concept seemed interesting to me, a quick search revealed that it was never discuss in any previous research so I am sharing it here in case it might be of some relevance (which is unlikely).
 

As you likely know, AI researchers are worried about safety. Some of the most potentially negative outcomes of AI research involve deception and long term strategy that would lead an agent to acquire more power and become dominant over humans. A super-intelligent,  unaligned AI will easily be capable of manipulating humans to achieve its goals when these diverge considerably from human goals. While extremely varied, these strategies have something in common: they make use of large amount of future time, after all it is easier to achieve a certain result if one has more time at hand, and AIs are not limited in thinking to human timescales. Therefore, the more time an AI has available the larger the set of deception strategies there are at its disposal.

 

A Theory of Deception

An AI is deceitful because it understand that it can generate larger amount of utility by pursuing results that are against human wishes, since it understand that humans do not desire this, it will engage in deceit, sacrificing present and near-future utility in exchange for much larger amounts of far-future utility. From an AI perspective a deception strategy, as a function of the time variable, looks something like this:

First, the AI has to conceive the deception strategy, this takes some time t1, then the AI has to initiate the deception, after some more time, t2 - t1, the strategy will bear fruit and the AI will be capable of generating large amount of utility, at this point nonalignment has likely become evident.

A Safe(ish) Oracle

A potential solution might be to give AI time preferences. For this example we will consider an oracle, an AI design which aims to restrict the agent to only answering questions, its utility function has been selected for answering questions truthfully (or so we think), however, on top of this we add a time horizon to the function. The time horizon is a certain point in the future after which the utility function will immediately go to zero, and stay there forever. The oracle, therefore, has only a limited amount of time to generate utility. If the time horizons are adequate, long term manipulation strategies and deception become uninteresting to the AI as these strategies can only generate utility in the future when the function has already decayed, and no more utility can be generated.

This approach is interesting as it doesn't limit the agent (super)intelligence but it might also be useful for solving the control problem in a more generalized way; such an oracle could, perhaps, monitor other AIs. If the time horizon, tO of an oracle falls within the interval t1 and t2 of a certain strategy S, the oracle might be willing to share this deception strategy with us if such an action could generate near term utility, since the agent has no use for it anyway. In fact, a time constrained oracle might be completely non deceitful if its time horizon is small enough or if large scale deception strategies are too time consuming.

Artificial Chronoception

To build such an agent the utility function must be modified to decay over time. Ideally some internal process of the model must be registered and correlated to real time with some stochastic analysis; once defined, this parameter must be incorporated in the utility function. Alternatively special hardware must be added to the AI to feed this information directly to the model, for safety reasons such information should be generated locally with some unalterable physical process (radioactive decay?).



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能安全 时间偏好 预言机
相关文章