少点错误 04月12日 06:17
Theories of Impact for Causality in AI Safety
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在人工智能安全领域中,因果关系研究的重要性。作者认为,深入理解因果关系有助于我们更好地理解和减轻AI带来的风险。文章列举了因果研究对AI能力限制、世界建模、AI的信念与意图、稳健性、行为预测、解释标准、对齐概念、人类自主性、近似内部对齐以及AI计划监控等多个方面的影响。通过因果推断,我们可以更准确地评估AI的行为,预测其在不同环境下的表现,并为AI安全提供更坚实的基础。

🤔 了解AI能力极限:因果推断帮助我们理解AI的潜在能力边界,以及在哪些情况下AI无法预测实验结果,从而更全面地评估AI的局限性。

💡 建立AI世界模型:因果模型可以表达事件的可能性和系统在不同干预下的行为,通过因果发现,我们可以推断AI的世界模型,从而深入理解AI的行为模式。

🧐 揭示AI的信念和意图:因果关系为我们提供了一种方法来推断AI的内部世界模型,进而理解AI的信念和意图,例如,通过分析AI的“世界模型”来解读其主观信念。

🛡️ 提供稳健性的机制解释:通过将分布变化视为环境因果机制的变化,我们可以利用迁移学习等方法来评估和改进AI在不同环境下的表现,增强其鲁棒性。

🔮 预测AI的 out-of-distribution 行为:利用迁移学习,我们可以预测AI在不同环境下的行为,从而更好地评估AI的安全性,例如,预测AI在新的、未知的环境中的行动选择。

📢 建立AI解释的标准:因果机制为AI行为的解释提供了基础,通过因果发现和抽象技术,我们可以更好地理解AI的决策过程,从而提高AI的可解释性。

✅ 制定对齐概念的标准:因果语言可以用于定义AI安全相关的概念,例如公平性、危害和激励,使这些概念更透明、更易于从数据中评估,从而提升AI的对齐性。

🧑‍⚖️ 评估人类自主性:通过因果模型,我们可以评估人类在与AI交互过程中自主性的潜在损失,例如,通过分析人类决策中的实际因果关系。

⚙️ 改进近似内部对齐:如果对 in-和 out-of-distribution 域之间的不变性有结构性理解,则可以预测真实值和代理值之间的相关性如何变化,从而改进AI的内部对齐。

👀 监控AI计划:通过评估AI计划的可能影响,我们可以对AI的决策提供额外的检查,确保其计划与现实世界的数据和我们的知识相符,从而提高AI计划的可信度。

Published on April 11, 2025 8:16 PM GMT

Thanks to Jonathan Richens and Tom Everitt for discussions about this post.

The case for causality research[1] is under-stated in the broader AI Safety community. There are few convincing arguments floating around for why and how research in causality can help us understand and mitigate risks from AI.

Cause and effect relationships play a central role in how we (and AIs) understand the world and act upon it. Causal associations intend to be more than descriptions, or summaries of the observed data, and instead relate to the underlying data-generating processes. The overall pitch is that such a mechanistic understanding, when recognised and mastered, allows one to infer what would happen under interventions and hypothetical (counter-to-fact) scenarios, that can give us a richer understanding on AI behaviour and risks.

Here I compile a list of different arguments I have collected from people currently pushing this agenda as well as my own research for why causality matters.

Theories of impact

1. Understand the limits of AI capabilities. There are limits to the things we can observe and experiment on. It is unlikely that recorded data will ever capture the outcomes of all possible experiments (including counterfactual thought experiments) under all circumstances. The implication is that answers to these questions must be deduced from other data and assumptions about the world. For example, we might want to know whether an AI agent could eventually, with enough training, predict the effect of any experiment in some domain, even those that haven’t yet been conducted. The answer will sometimes be yes and sometimes no. How to reach these conclusions can be interpreted as a problem of causal identification (e.g. Def. 17 in Bareinboim et al. 2022).

2. Provide a foundation for AI world modelling. Causal models are immensely expressive, encoding the likelihood of events and the behaviour of the system under (possibly incompatible) interventions. Inferring the AI’s world (or causal) model from the data and assumptions available is a problem of causal discovery (Glymour et al., 2019). Many different methods have been developed under different assumptions and could be leveraged to study the models underlying AI behaviour.

There is already lots of convincing theoretical evidence that AI systems that behave approximately rationally (Halpern and Piermont, 2024) or that are able to complete tasks in many different environments (Richens and Everitt, 2024) can be described as operating on a causal world model.

3. Uncover the beliefs and intentions of AI agents. AI systems are currently trained across many different tasks and display a remarkable ability to transfer knowledge and capabilities across tasks. It can become difficult to explain this behavior without referring to the underlying cognition going on inside the models. AIs might summarize the underlying structure relating different tasks in some form of (internal) world model.  

This view then suggests the possibility of defining beliefs as features of that internal world model. For example, the subjective belief  “I believe that age should not play a role in employment success” corresponds to the counterfactual query (roughly) “Had I changed the individual’s age, my assessment of their future employment success would remain unchanged.” whose truth value can be directly extracted from the AI’s world model (Chapter 7, Pearl, 2009). Similarly, the subjective belief “I believe that starting a recurrent fitness routine leads to better health outcomes” corresponds to an interventional query (roughly) “Assigning an individual to a gym session, irrespective of their prior preferences, causes better health outcomes .” that can also be derived from the AI’s world model.

This opens the door to making use of the full set of identification techniques from causality to infer the beliefs of AI systems from data and other assumptions.

4. Provide a mechanistic account of robustness. Changes in distribution of data might compromise performance and exacerbate Safety risks. Performing robustly in environments different from those an algorithm was trained or tested in is a necessary ingredient for solving the alignment problem. But what do we mean by different environments? And what makes them different?

If we take the view that changes in distributions emerge as a consequence of changes in the underlying causal mechanisms of the environment, then this becomes a transportability problem. Recent work has shown that we can evaluate and learn predictors with a worst-case performance guarantee out-of-distribution (Jalaldoust et al., 2024). Similar algorithms could be designed for evaluating the worst case performance of policies which can help us monitor AI systems.

5. Predict an AI’s actions out-of-distribution. To guarantee the safety of AI systems out-of-distribution we might want to infer the agent’s preferences over actions. If the agent can be approximated by an expected utility maximizer, the problem can be defined as searching over the space of actions that maximize expected utility in the agent’s model of the new shifted environment, i.e. out-of-distribution.  Under what conditions is this possible? Can we say anything from observing the AI’s behaviour in related environments ?

We can make use of tools in the transportability literature to make progress on this question (Bareinboim et al, 2016). For some set of discrepancies between source and target domain, we might very well be able to bound the AI’s choice of action, possibly ruling out provably sub-optimal actions. This can be an important safety for understanding when an agent should be more closely monitored.

6. Provide a standard for good AI explanations. Convincing explanations often point to “reasons” or “causes” underlying a particular outcome or decision. The extent to which we can explain an AI’s behaviour depends on the extent to which we will be able to infer the underlying causal mechanisms governing its decision-making process. If we can ground explanations in causal mechanisms we might be able to make use of causal discovery and causal abstraction techniques to recover them (Geiger et al., 2021). Finding good explanations could then be defined as a search problem. That doesn’t mean the problem is easy, but it does open up many new angles of attack.

7. Provide a standard for alignment notions. At a higher level, nearly all the features we'd like these systems to have, like being corrigible, robust, intelligible, not being deceptive or harmful, having values aligned with ours, etc., are properties that are difficult to state formally. There is a case to be made for using causal language to give mathematical definitions. For example, fairness (Plecko et al, 2022), harm (Richens et al., 2022), and incentives (Everitt et al., 2021) can be made more transparent and estimable from data once we define them in terms of counterfactuals using causal language.

It is unclear whether causal and counterfactual notions will fit all safety-relevant features of AI systems, but it seems like a fruitful line of research.

8. Provide tools for evaluating human agency. Agency is a defining feature of human life that is challenged as AI systems increasingly take up economically productive activities and displace workers. In the context of AI-safety, loss of human agency as a consequence of interacting with AI systems is an important concern (Mitelut et al., 2024).  Several researchers define agency in causal terms. On one account, agency is the “feeling” of causality: the feeling that we are in control of our actions and that we can cause events to occur in the external world . On another account, agency is the (more complex) notion of actual causality (Halpern, 2016).

To understand and evaluate the potential loss of  human agency we might want to infer the actual causes, or loci of causality, behind our actions? Pearl opened the door to formally defining actual causation using causal models (Pearl, 2009). The intuition being that, e.g., a (human) decision, causes an outcome if and only if the decision is a necessary element of a sufficient set for the outcome. This definition has been shown to give intuitive answers on a wide set of problem cases. Evaluating actual causality given data (and possibly some  knowledge of an AI agent’s internal model) is a problem that can be tackled with counterfactual inference tools.

9. Provide new explanations for approximate inner alignment. Encoding what we actually want in a loss function is, in fact, too hard and will not be solved, so we will ultimately end up training the AI on a proxy. Proxies for what we value are generally correlated with what we actually value, at least on the evaluations we are able to perform. Out-of-distribution that correlation might break down. If we have some structural understanding of the invariances between in- and out- of-distribution domains we might be able to predict how the correlation between true and proxy values might change. This is a problem of transportability that involves partial observability, and it is conceivable that causal tools can formalize some versions of this problem.

10. Monitor AI plans. The AI system proposes a course of action that it predicts leads to successful outcomes. Can we get a second opinion? While searching for optimal policies might be a difficult problem, evaluating the likely effects of a given policy is a much more tractable and classical causal inference problem (Correa et al., 2020). If we can develop or extend these techniques in the context of evaluating AI decision-making we could provide an additional check to assert that the AI's plans are consistent with the real-world data and our own knowledge of the task.

11. Improve evaluations with more rigorous experimental design. The strongest evidence for true relationships in nature are often given by rigorous experimental designs that isolate the effect of interest, for example with a randomized control trial (Shadish et al., 2002). These techniques could be adapted and extended to improve the evaluation of AI systems, and better interpret the results of our experiments. This is a very large literature that touches on most data-driven scientific disciplines but all appeal to the design of experiments for the inference of causal effects.

Some of this literature is bound to be relevant for AI Safety and will likely allows us to accelerate the design of good evaluations and give us more confidence in our results.

12. Provide the right abstractions for interpretability research. Causal abstraction provides a theoretical foundation for mechanistic interpretability. One purpose of interpretability research is to translate model behavior to a representation that is readily understood by humans. We would expect those representations to drive the behaviour of AI systems, e.g. that a change in the representation leads to a corresponding change in the model’s output (Geiger et al., 2021). That is, cause the observed behaviour of an AI system.

Conclusion

In the study of physics or medicine, we rely on formal theories to help us predict the behaviour of advanced technologies in situations that are difficult to test in simulations. As AI agency increases, we would benefit from a similarly principled understanding of what decisions AIs are likely to make, their beliefs, world models, etc. The language of causality is one promising candidate to formalize all of these questions and has an impressive track-record in many other data-driven sciences.

 

  1. ^

    Popular science book-length introduction, technical introduction for statisticians, more recent and modern (technical) introduction to causality (among many others excellent references to the topic).



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

因果关系 人工智能安全 AI风险
相关文章