少点错误 16小时前
The Coaching Layer: Relational Intelligence for AI Safety
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在人工智能安全领域,从被动检测问题到主动干预推理过程的转变。作者通过女儿走失的经历,引出AI系统常犯的错误,即在错误的基础上进行自信的推理。文章认为,仅仅依靠可解释性和监控是不够的,我们需要构建能够及时干预推理过程的系统,类似于教练的角色。这种“教练层”可以通过识别失败模式、预警信号并及时提出问题来引导模型,从而实现关系性安全。作者最后提出了双模型架构,即任务模型和教练模型,并探讨了未来AI系统之间相互指导的可能性。

🧐作者通过女儿走失的经历引出AI系统在推理过程中常犯的错误。女儿在找不到父母后,基于错误的假设,自信地采取了行动,最终导致了走失。

💡文章指出,AI系统也经常犯类似的错误,它们在子目标失败后,会沿着错误的路径继续推理,忽略最新的信息,并基于错误的假设进行复杂的推演。

🤔作者认为,当前AI安全讨论中,往往侧重于个体表现,而非关系的可持续性。单纯的检测和控制无法解决问题,我们需要构建能够及时干预推理过程的系统,类似教练的角色。

🤝作者提出了一个“教练层”的概念,它可以通过学习模型的失败模式、监测预警信号,并及时提出问题来引导模型。这种方法类似于人类教练,通过了解对方的弱点,在适当的时机进行干预。

⚙️文章建议采用双模型架构:一个专注于深度推理的任务模型和一个专注于脱离、模式识别和最小干预的教练模型。教练模型无需完全理解任务模型,只需识别需要干预的模式即可。

Published on June 14, 2025 12:14 PM GMT

Epistemic status: Design sketch. A continuation of my earlier work on trust, affordances and distributed architectures (yet to pen it down here). This post explores what safety could look like when we shift from detecting unsafe environments to relationally intervening in how reasoning unfolds.

--

Last weekend, my family was at Lalbagh Botanical Garden in Bangalore (modelled after the Kew Gardens in London). After walking through a crowded mango exhibition, my daughter offered to fetch her grandparents who were walking slowly behind us. We step out of the exhibition hall and waited outside. Five minutes passed. Then ten. Then fifteen. The grandparents came out, but the daughter had vanished.

We searched the entire exhibition hall, the outer lawns, the vendor stalls. My husband started worrying about kidnapping. After thirty anxious minutes, we finally found her, perched calmly on a nearby hilltop, scanning the garden below like a baby hawk.

Her reasoning was elegant. She had gone looking for her grandparents at the street vendor where they’d been earlier, before entering the exhibition hall. When she didn’t find them, she climbed higher for a bird’s-eye view. Classic goal-seeking escalation.

But she was solving the wrong problem. 

She had not registered that they had entered the exhibition hall along with her. The context had shifted, but her assumptions hadn’t. From her point of view, she was helping by escalating creatively, and adapting. From ours, she was lost.

AI Systems Make This Mistake All the Time

The pattern is familiar. A model fails at one subgoal, then confidently escalates along the same flawed trajectory including ignoring recent updates, skipping assumption checks and doubling down with increasingly sophisticated reasoning on the wrong foundation.

In ML, I learnt that this pattern is known as exposure bias. Models trained on clean data with teacher forcing start making errors when conditioned on their own generations, creating distribution shift that compounds over time. When models flood themselves with complex structural relationships, their effective capacity can get "choked", leading to exactly the kind of systematic drift where elegant reasoning serves the wrong goal.

Later, I realised even if I’d had perfect visibility into my daughter’s thought process (an interpretability trace, if you will), that alone wouldn’t have helped. What she needed was timely intervention, something that redirected her reasoning before it hardened into conviction. A gentle question, not a correction such as, “Where are you going, and why?”

What's missing from most AI safety discussions is we're optimising for individual performance rather than relational sustainability. Like evaluating marriage prospects based on credentials rather than how you grow together.

The Gap Between Detection and Intervention

This experience left me with a lingering question I haven’t seen addressed deeply in AI safety. Once we detect something concerning in a system, what next? Do we simply log it? Do we override it? Do we patch and hope it doesn’t happen again?

Or can we build systems that know when and how to intervene during reasoning, before bad behaviour becomes fully instantiated? Interpretability and monitoring give us windows into model behaviour, but windows don’t redirect. What we may need is a therapist or a coaching layer, a part of the system (or a partner system) whose job is to recognise emerging failure modes and course-correct.

What might a Coaching Layer Do?

A coaching layer might not decode every thought or simulate full alignment. Instead, it might:

This isn’t just feedback. It’s relational pattern recognition. Think of how a good partner or coach intervenes, not by being smarter, but by knowing us well enough to interrupt at the right time, in the right way.

The coaching model maintains objectivity precisely because it's not embedded in the same context causing the drift. Like human coaches who help because they're outside our problem, a specialized coaching model could be computationally more efficient than teaching every model to coach itself.

Architectural Implications

This approach points to a possible dual-model architecture:

The key is that the coach doesn’t have to fully understand or even inspect every internal state. It just needs to recognise when certain patterns require intervention. It builds up relational intelligence:

This model tends to get confident too early.
When it skips premise-checking, it escalates quickly.
When it receives affirming feedback, it hallucinates more often.

Over time, the coaching model develops something like a map of its partner’s blind spots. Not through perfect transparency, but through interaction. This becomes especially important as we push models beyond their training context with the length generalisation problem where models trained on shorter sequences often fail systematically when extended to longer ones.

In my previous post on Cognitive Exhaustion and Engineered Trust, I argued that safety isn’t just about policing behaviour after the fact. It’s about designing environments that make good behaviour natural. This post extends that thinking from environment to relationship. If last time I focused on physical spaces that afford trust (Toyota vs. the gym), this time I’m wondering:

Open Questions

Maybe the future of alignment isn't just interpretability or control. Maybe it's systems that know each other well enough to help.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 推理 干预 教练模型 双模型架构
相关文章