少点错误 12小时前
Coaching AI: A Relational Approach to AI Safety
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在AI安全领域中,从传统的检测和纠正错误行为,转变为实时干预AI推理过程的新方法。作者通过女儿在公园走失的例子,引出了AI系统在面对新信息时,未能及时更新世界模型,导致错误决策的问题。文章提出了一种“教练”模型,该模型与主任务模型并行工作,通过提出问题来引导AI,纠正其推理方向。这种方法强调了在AI系统中引入动态、情境相关的反馈,从而提高其安全性和可靠性。

🧐 传统的AI安全方法,如可解释性和对齐,侧重于从外部塑造或设计AI系统,但可能无法有效应对实时推理中的问题。

🤔 作者提出了“教练”模型的概念,该模型作为轻量级的辅助系统,通过在关键推理时刻提出问题,而非直接干预或控制,来引导AI的推理过程。

💡 与现有的安全和对齐研究方法相比,教练方法更侧重于建立一种关系型框架,强调动态、情境相关的交互,类似于人类的指导方式。

🛠️ 教练模型可以通过多种方式训练,包括基于历史失败模式的强化学习,以及检测信心与能力不匹配的轻量级监督。

🚀 这种方法可能需要双代理架构,一个专注于解决主要问题的任务模型,以及一个专注于关系元意识和轻量级干预的教练模型。

Published on June 16, 2025 3:33 PM GMT

Epistemic status: Design sketch. This post continues a broader enquiry into trust, affordances and distributed architectures. While those ideas are still in development, this post explores how real-time, relational interventions might offer a complementary path to AI safety. Specifically, it asks what safety could look like if we shifted from detecting unsafe behaviour after it happens to relationally intervening in reasoning as it unfolds.

--

A couple of weekends ago, my family was at Lalbagh Botanical Garden in Bangalore. After walking through a crowded mango exhibition, my 8-year-old offered to fetch her grandparents, who were walking slowly behind us. We waited outside the exhibition hall.

Five minutes passed. Then ten. Then fifteen. The grandparents emerged from the hall, but my daughter had vanished. After thirty anxious minutes, we found her perched calmly on a nearby hilltop, scanning the garden below like a baby hawk.

Her reasoning was logical. She remembered where her grandparents had last stopped (a street vendor) and went to look for them there. When she didn’t find them, she climbed up a hillock for a bird’s-eye view. Perfectly reasonable, except she had completely missed them entering the hall with her.

Her model of the world hadn’t updated with new context, so she pursued the wrong goal with increasing confidence. From her perspective, she was being helpful and clever. From ours, she was very much lost.

The Confident Pursuit of the Wrong Objective

This is a pattern familiar in AI where systems escalate confidently along a flawed trajectory. My daughter’s problem wasn’t lack of reasoning, it was good reasoning on a bad foundation.

Large models exhibit this all the time. An LLM misinterprets a prompt and confidently generates pages of on-topic-but-wrong text. A recommendation engine over-indexes on ironic engagement. These systems demonstrate creativity, optimisation, and persistence, but in the service of goals that no longer reflect the world.

This I learnt is framed in AI in terms of distributional shift or exposure bias. Training on narrow or static contexts leads to brittleness in deployment. When feedback loops fail to re-anchor a system’s assumptions, it just keeps going confidently, and wrongly.

Why Interpretability and Alignment May Not Be Enough

Afterward, I tried to understand where my daughter's reasoning went wrong. But I also realised that even perfect transparency into her thoughts wouldn’t have helped in the moment. I could interpret her reasoning afterward, but I couldn’t intervene in it as it unfolded. What she needed wasn’t analysis. She needed a tap on the shoulder, and just a question (not a correction, mind you) - “Where are you going, and why?” 

This reflects a limitation in many current safety paradigms. Interpretability, formal alignment, and corrigibility all aim to shape systems from the outside, or through design-time constraints. But intelligent reasoning in a live context may still go off-track. 

This is like road trips with my husband. When Google Maps gets confused, I prefer to ask a local. He prefers to wait for the GPS to “figure it out.” Our current AI safety approaches often resemble the latter, trusting that the system will self-correct, even when it’s clearly drifting.

A Relational Approach to Intervention: Coaching Systems

What if intelligence, especially in open-ended environments, is inherently relational? Instead of aiming for fully self-aligned, monolithic systems, what if we designed AI architectures that are good at being coached?

We could introduce a lightweight companion model, a “coach”, designed not to supervise or override, but to intervene gently at critical reasoning junctures. This model wouldn’t need full interpretability or full control. Its job would be to monitor for known failure patterns (like confidence outpacing competence) and intervene with well-timed, well-phrased questions.

Why might this work? Because the coach retains perspective precisely because it isn’t buried in the same optimisation loop. It sees the system from the outside, not from within. It may also be computationally cheaper to run than embedding all this meta-cognition directly inside the primary system.

Comparison to Existing Paradigms

This idea overlaps with several existing safety and alignment research threads but offers a distinct relational frame:

In short, coaching aims to foreground situated, lightweight, real-time feedback, less through recursion, adversarial setups, or predefined rules, and more through the kind of dynamic, context-sensitive interactions that resemble guidance in human reasoning. I don’t claim this framing is sufficient or complete, but I believe it opens up a promising line of inquiry worth exploring.

Implementation Considerations

A coaching system might be trained via:

To function effectively, a coaching model would need to:

Sample interventions:

Architectural Implications

This approach suggests a dual-agent architecture:

The coach doesn’t need deep insight into every internal weight or hidden state. It simply needs to learn interaction patterns that correlate with drift, overconfidence, or tunnel vision. This can also scale well. We could have modular coaching units trained on classes of failures (hallucination, overfitting, tunnel vision) and paired dynamically with different systems.

Of course, implementing such a setup raises significant technical questions, including how do task and coach models communicate reliably? What information is shared? How is it interpreted? Solving for communication protocols, representational formats and trust calibration are nontrivial. I plan to explore some of them more concretely in a follow-up post on Distributed Neural Architecture (DNA). 

Why This Matters

The future of AI safety likely involves many layers, including interpretability, adversarial robustness, and human feedback. But these will not always suffice, especially in long-horizon or high-stakes domains where systems must reason through novel or ambiguous contexts. 

The core insight here is that complex reasoning systems will inevitably get stuck. The key is not to eliminate error entirely, but to recognise when we might be wrong, and to build the infrastructure for possibility of course correction. My daughter didn’t need to be smarter. She needed a nudge for course correction real-time.

In a world of increasingly autonomous systems, perhaps safety won’t come from more constraints or better rewards, but from designing architectures that allow systems to be interrupted, questioned, and redirected at just the right moment.

--

Open Questions

--

If coaching offers a micro-level approach to safety through localised, relational intervention, DNA begins to sketch what a system-level architecture might look like, one where such interactions can be compositional, plural, and emergent. I don’t yet know whether this framework is tractable or sufficient, but I believe it’s worth exploring further. In a follow-up post, I will attempt to flesh out the idea of Distributed Neural Architecture (DNA), a modular, decentralised approach to building systems that reason not alone, but in interaction.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 教练模型 实时干预 关系型AI
相关文章