Published on June 14, 2025 12:14 PM GMT
Epistemic status: Design sketch. A continuation of my earlier work on trust, affordances and distributed architectures (yet to pen it down here). This post explores what safety could look like when we shift from detecting unsafe environments to relationally intervening in how reasoning unfolds.
--
Last weekend, my family was at Lalbagh Botanical Garden in Bangalore (modelled after the Kew Gardens in London). After walking through a crowded mango exhibition, my daughter offered to fetch her grandparents who were walking slowly behind us. We step out of the exhibition hall and waited outside. Five minutes passed. Then ten. Then fifteen. The grandparents came out, but the daughter had vanished.
We searched the entire exhibition hall, the outer lawns, the vendor stalls. My husband started worrying about kidnapping. After thirty anxious minutes, we finally found her, perched calmly on a nearby hilltop, scanning the garden below like a baby hawk.
Her reasoning was elegant. She had gone looking for her grandparents at the street vendor where they’d been earlier, before entering the exhibition hall. When she didn’t find them, she climbed higher for a bird’s-eye view. Classic goal-seeking escalation.
But she was solving the wrong problem.
She had not registered that they had entered the exhibition hall along with her. The context had shifted, but her assumptions hadn’t. From her point of view, she was helping by escalating creatively, and adapting. From ours, she was lost.
AI Systems Make This Mistake All the Time
The pattern is familiar. A model fails at one subgoal, then confidently escalates along the same flawed trajectory including ignoring recent updates, skipping assumption checks and doubling down with increasingly sophisticated reasoning on the wrong foundation.
In ML, I learnt that this pattern is known as exposure bias. Models trained on clean data with teacher forcing start making errors when conditioned on their own generations, creating distribution shift that compounds over time. When models flood themselves with complex structural relationships, their effective capacity can get "choked", leading to exactly the kind of systematic drift where elegant reasoning serves the wrong goal.
Later, I realised even if I’d had perfect visibility into my daughter’s thought process (an interpretability trace, if you will), that alone wouldn’t have helped. What she needed was timely intervention, something that redirected her reasoning before it hardened into conviction. A gentle question, not a correction such as, “Where are you going, and why?”
What's missing from most AI safety discussions is we're optimising for individual performance rather than relational sustainability. Like evaluating marriage prospects based on credentials rather than how you grow together.
The Gap Between Detection and Intervention
This experience left me with a lingering question I haven’t seen addressed deeply in AI safety. Once we detect something concerning in a system, what next? Do we simply log it? Do we override it? Do we patch and hope it doesn’t happen again?
Or can we build systems that know when and how to intervene during reasoning, before bad behaviour becomes fully instantiated? Interpretability and monitoring give us windows into model behaviour, but windows don’t redirect. What we may need is a therapist or a coaching layer, a part of the system (or a partner system) whose job is to recognise emerging failure modes and course-correct.
What might a Coaching Layer Do?
A coaching layer might not decode every thought or simulate full alignment. Instead, it might:
- Learn failure patterns of a model (e.g., confidence + escalation = tunnel vision)Watch for early warning signs of context drift or unchecked assumptionsIntervene with simple, well-timed questions that nudge the model back on track
This isn’t just feedback. It’s relational pattern recognition. Think of how a good partner or coach intervenes, not by being smarter, but by knowing us well enough to interrupt at the right time, in the right way.
The coaching model maintains objectivity precisely because it's not embedded in the same context causing the drift. Like human coaches who help because they're outside our problem, a specialized coaching model could be computationally more efficient than teaching every model to coach itself.
Architectural Implications
This approach points to a possible dual-model architecture:
- Task Model: Focused on deep, goal-oriented reasoning.Coach Model: Optimised for detachment, pattern recognition, and minimal viable interventions.
The key is that the coach doesn’t have to fully understand or even inspect every internal state. It just needs to recognise when certain patterns require intervention. It builds up relational intelligence:
This model tends to get confident too early.
When it skips premise-checking, it escalates quickly.
When it receives affirming feedback, it hallucinates more often.
Over time, the coaching model develops something like a map of its partner’s blind spots. Not through perfect transparency, but through interaction. This becomes especially important as we push models beyond their training context with the length generalisation problem where models trained on shorter sequences often fail systematically when extended to longer ones.
In my previous post on Cognitive Exhaustion and Engineered Trust, I argued that safety isn’t just about policing behaviour after the fact. It’s about designing environments that make good behaviour natural. This post extends that thinking from environment to relationship. If last time I focused on physical spaces that afford trust (Toyota vs. the gym), this time I’m wondering:
- What kinds of relational architectures afford correction?What kinds of interactive patterns make early intervention likely?And how do we design systems that can coach each other, not just interpret, not just flag, but truly guide by enabling reflection?
Open Questions
- What failure modes have you observed in LLMs or agents that seem ripe for coaching rather than control?How might two systems develop mutual awareness of each other’s blind spots?What would it take for models to learn to coach one another, relationally, adaptively, imperfectly, but effectively?
Maybe the future of alignment isn't just interpretability or control. Maybe it's systems that know each other well enough to help.
Discuss