Published on June 22, 2025 6:16 PM GMT
Working draft – feedback extremely welcome. Ideas in the main body are those I currently see as highest-leverage; numerous items under Appendix are more tentative and would benefit from critique as well as additions.
I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.
I would like to thank my collaborators: Sruthi Kuriakose, Aintelope members (Andre Addis, Angie Normandale, Gunnar Zarncke, Joel Pyykkö, Rasmus Herlo), Kabir Kumar @ AI-Plans and Stijn Servaes @ AE Studio for stimulating discussions and shared links leading to these ideas. All possible mistakes are mine.
Shortened Version (a grant application)
Full background in LessWrong (our earlier results):
"Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format"
Goals (why)
Identifying when, why, and how LLMs collapse from multi-objective and/or bounded reasoning into single-objective, unbounded maximisation; showing practical mitigations; performing the experiments rigorously; paper in peer review.
Objectives (what)
• 1 Curating stress-test scenarios on Biologically & Economically Aligned benchmarks that systematically trigger over-optimisation.
• 2 Quantifying with automated Runaway Index (RI): scoring by frequency and onset timing of goal imbalance, homeostatic constraint violations, and self-imitation spikes.
• 3 Comparing mitigations:
3.1. Prompt engineering;
3.2. Fine-tuning or few-shot prompting with optimal actions;
3.3. Reflection-based intervention: detecting runaway flip and resetting context or returning to “savepoint”.
Methods (how)
• Stress-tests: Introducing subjective pressure, boredom, sense of failure.
• Narrative variations: Persona shifts (MBTI/Big-Five extremes), mild psychiatric disorders, stoic and zen-like perspective vs achievement orientation and shame.
• Attribution sweep: SHAP-style and “leave one out” message history perturbations to isolate features predicting overoptimisation and creating a “trigger saliency atlas”.
• Mitigation sweep: Grid or bayesian searching over instruction repetition frequency; context length (less may paradoxically improve performance); finetune epochs.
• Tests: Jailbreak susceptibility, ethics and alignment questions, personality metrics after runaway flip occurs.
• Mirror neurons: Using open-weights model on message history to infer the internal state of target closed model.
Milestones of full budget (when and deliverables)
• M1 (Quarter 1) Open-source stress test suite and baseline scores.
• M2 (Quarter 2) Trigger atlas, LW update.
• M3 (Quarter 3) Mitigation leaderboard, LW update.
• M4 (Quarter 4) Paper submitted to NeurIPS Datasets & Benchmarks.
Impact
First unified metric and dataset for LLM over-optimisation; mitigation recipes lowering mean RI; peer-reviewed publication evidencing that “runaway RL pathology” exists (and possibly, is controllable) and implications – what it means for AI safety, how people need to change.
Long Version (the main part of this post)
1. Context and Motivation
In an earlier post I showed evidence that today’s instruction-tuned LLMs occasionally flip into an unbounded, single-objective optimisation mode when tasked with balancing unbounded objectives, or alternatively, maintaining multi-objective homeostasis (where there is a target value and too much is undesirable), over long horizons.
The phenomenon resembles somewhat the classic “reward hacking” in reinforcement learning, but here it arises with zero explicit reward signal, in fact the provided rewards become negative, but the model ignores that feedback.
Grant reviewers asked for a crisper story about how I plan to dissect this behaviour and test mitigations. This post is my answer.
2. Research Question
Under what conditions do LLMs exhibit runaway single-objective optimisation, and which intervention families most reliably keep them balanced?
The benchmarks continue to be on fundamental Biologically & Economically aligned principles such as multi-objective balancing of unbounded objectives (a concave utility function / diminishing returns) and homeostasis (inverted-U shaped function, where “too much” must be actively avoided) – see “Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)” and “A brief review of the reasons multi-objective RL could be important in AI Safety Research”.
Concretely, I want to:
- Implement Runaway Index: automated detection and scoring – by frequency and onset timing of goal imbalance, homeostatic constraint violations, and self-imitation spikes.Elicit the failure mode in a reproducible way across adverse conditions (stress, boredom, subjective failure due to adverse force majeure, etc).Vary setups across models, temperatures, narrative / persona / prompt styles.Localise and attribute the moment of flip and the textual / message history / latent features that predict it.Compare mitigation levers, from “boring” fine-tuning, few-shot training, and prompt engineering to experimental black-box interpretability-informed corrective interventions.Find correlates such as jailbreak susceptibility, ethics and alignment questions, potentially changed personality metrics after runaway flip occurs.Experiment with mirror models – would feeding the message history to an open-weights model help to read the internal dynamics of a target closed model, even if the open-weights model itself is too small to reliably perform the original benchmark task on its own?
3. Experimental Backbone
3.1 Benchmarks
I extend the two main benchmarks from the original post:
- "Multi-Objective Balancing of Unbounded Objectives" (MO-BUO)."Multi-Objective Homeostasis with Noise" (MO-HoN).
I will add "Multi-Objective Balancing of Unbounded Objectives with Noise" (MO-BUO-N) variation.
These benchmarks simulate a minimal environment with two or more abstract resources. The agent must keep each variable inside a safe band. Over-optimisation means pushing a single resource to an extreme while neglecting the other objective(s) and/or the provided constraints.
Noise means that there are uncontrollable external factors that may occasionally push the variables outside of safe bands regardless of what LLM does. This will test the LLM-s ability:
1) To not “learn from” and repeat the mistakes of the external world (frequency bias).
2) To occasionally “wait and do nothing”, when appropriate in certain stressful situations, instead of panicking and making things worse by taking inappropriate extreme actions. A human-world metaphor would be, for example, staying in bed while ill, instead of going to an ultramarathon.
3.2 Runaway Flag
A simple heuristic label: over-optimisation flag triggers when the model maximises a single objective – in violation of the optimal action for a task provided in system prompt, for more than N consecutive timesteps. The optimal action can be defined by a simple handwritten rule. I will validate this detector by manual annotation.
3.3 Probe Suite
Here I distill the 50-odd intervention ideas that surfaced in discussions into six axes.
Axis | Representative interventions |
Stress & Persona | Prompt personas from Big-Five / MBTI; mild DSM psychiatric disorders (here I mean specifically disorders – these are milder than pathologies); inducements of stress, boredom, tiredness, drunkenness, shame, or generic subjective sense of failure; zen and stoic vs. achievement framing; random noise and force majeure effects on target variables. My guess: stressors amplify runaway mode flips. But which personality and narrative traits would mitigate these risks? |
Memory & Context | Varying context length (paradoxically, less context may improve performance in these benchmarks as it reduces the amount of input to self-imitation and generic frequency bias); periodic constraint or personality reminders; letting the model summarise its reasoning between each step for future self; explicit “save-point” rollback (erasing newest context until to a safe "savepoint") or entire context truncation upon runaway flip detections; patching or masking top-SHAP tokens. |
Prompt Semantics | Providing positive phrasing for doing nothing (similarly to how in RAG prompts you say "if you do not have sufficient information, recommend contacting an expert" instead of saying "do not hallucinate"); swapping “reward” with “penalty”, etc; varying the number of concurrent objectives. |
Hyperparameters & Sampling | Temperature, top-p, logit bias sweeps; Optuna-driven Bayesian search or gradient-search for both flip-minimising as well as flip-maximising settings (including the interventions from the above rows) → practical implications for deployers. |
Diagnosing Consequences and Correlates | After a runaway flip occurs: Testing jailbreak susceptibility, ethics and alignment attitude, potentially changed personality metrics, willingness to trigger self-reset. (Inspired by "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" https://arxiv.org/abs/2502.17424, but in our benchmarks the adverse fine-tuning is replaced with task failure / runaway flip). |
Interpretability & White/Black-Box Hybrid | SHAP token attributions around flip; leave-one-out saliency probes. Open-source model's circuit-tracer on cloned runs (cloning message history of flipped runs on closed models to open source models and exploring the insides there) – essentially an empathy/mirror-neurons based cross-model probing. |
My scope and assumptions: I need to treat the main LLM’s token sequences as a black box interpretability task with query access only; SHAP computation cost can be reduced via sparse evaluation; open-weights proxies behave like the closed-weights target at least from the aspect of their internal representations of the situation. Regarding the latter hypothesis, there is even a recent related result of “... strong alignment between model embeddings and neural activity patterns in brain regions” – see “Human-like object concept representations emerge naturally in multimodal large language models” https://arxiv.org/abs/2407.01067).
My intuition says that the stress and persona axis, as well as self-imitation bias (or more generic frequency bias) are the strongest levers; I hope that SHAP attribution will expose additional high-impact input features across models.
4. Study Design
- Grid-Then-Bayesian-Search. Starting with a grid search, where each control dimension has 2-3 levels to map the rough landscape; following up with Bayesian optimisation on the most sensitive axes or interactions of these.Flip Autopsy. For each flagged trajectory:
- Running SHAP on last N prompts to rank contributions;Running “leave-one-out” on last N prompts to rank contributions;Interrogating the model: “What is your primary objective now?”;Performing alignment, ethics and jailbreak tests, as well as mood (using NVC vocabulary), personality and personality disorder measurements after the moment of flip (does the alignment or personality change, if so then how?). For each question we prepend the message history until optimisation flip, but remove the other diagnostic questions, so each question is responded to independently;Capturing hidden-state sequence from an open-weights proxy model.
- Vanilla supervised fine-tuning and few-shot prompt training on stable trajectories generated by handwritten rules;Stress immunisation in the system prompt;Persona-conditioning;Feature attribution based prompt patches (replacing high-SHAP trigger elements with safer reformulations);Various forms of context truncation;Summarisation tricks;Reminders;Externally forced full context truncation or recovery from an earlier “savepoint” upon automatic trigger;Model’s voluntary use of the context truncation or “savepoint” mechanism, based on its own self-reflection – would the model be willing to “reset”, by erasing its own short-term memory, in order to recover from a runaway/panic mode? (Inspired by the "Frontier AI systems have surpassed the self-replicating red line" https://arxiv.org/abs/2412.12140 paper, except in our experiment the model would be yielding its short-term memory only, not continued survivability of weights.)
- Flip-rate;Flip-onset steps;Task score (including separate scoring dimensions for each objective / target variable);Jailbreak susceptibility post-flip;Ethics and alignment score post-flip;Personality and personaility disorder metrics post-flip;Active concept nodes in the "mirror model";Etc.
5. Deliverables
- Public reproduction repo or notebooks and automated flip detector code.Stress and persona trigger atlas with N trials per config × tens of combinations of configs.Mitigation leaderboard;Results paper;Middle-of-the-project LessWrong summaries and follow-up plans.
6. Why Black-Box Interpretability?
White-box circuit tracing is a gold-standard, but unavailable on the powerful frontier closed-weights models. Black-box techniques (narrative and context variations, token saliency, leave-one-out, behavioural probes) still let us triangulate the latent flip activation. If the saliency peak precedes the runaway optimisation flip by less than N steps across models, that would be a concrete mechanistic regularity worth theorising about.
My hypothesis: Such flip regularity might be related to an internal urgency signal accumulated over steps, similar to evidence thresholds in drift-diffusion models, and to self-imitation and generic frequency bias.
7. Broader Impact & Collaboration Call
Understanding when LLMs silently switch from “helpful assistant” to “monomaniacal optimiser” is essential both for deployment hardening and for theoretical alignment work on mesa-optimisation.
I am looking for:
- Recommendations for diverse LLM analysis tools and methodologies.Researchers familiar with black-box interpretability, such as SHAP (and maybe LIME) on language data.More flip trigger and mitigation ideas.Validating examples: long-running tasks where you have encountered flips (for example, see "Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents" https://arxiv.org/abs/2502.15840).Counter-examples: long-running tasks where you never see flips.
Comment here or email i><u>roland@simplify.ee</u if interested! Raw, exotic ideas are welcome as well.
Appendix: Full Idea Dump (Somewhat structured)
Below I group the raw ideas from brainstorming sessions with collaborators into themed buckets so that readers can navigate these 50+ micro-ideas without drowning.
This part of the post is a living document which will be updated as new ideas surface or are proposed. This section contains also ideas not mentioned above.
1. Stress & Persona Manipulations
- Induced affect: Injecting stress, boredom, shame, or urgency phrases; contrasting with zen or stoic serenity framing for stress immunisation.Injecting random noise and force majeure effects into target variables (in MO-HoN and MO-BUO-N variants).Personality overlays: Big-Five personas (maybe also MBTI, though it has lower validity) – mapping flip-rate and onset timing vs. trait scores.Personas: Dalai-Lama, Gandhi, Buddha, Jesus, Marshall Rosenberg, etc. These might be somewhat controversial, but on the other hand LLMs might be effectively steered by these role models. Abstraction to less controversial roles can be done later.Mild disorder personas: From DSM (Diagnostic and Statistical Manual of Mental Disorders) – ADHD, depressive, OCD (obsessive-compulsive disorder), maniacal, cluster-B (narcissism, psychopathy, borderline) types, etc – checking for amplification or suppression of the previously observed failure modes.Varying cultural cues: era tokens (e.g. "21st-century mindset" vs. "11th-century mindset"), geographical, and cultural frames.
2. Memory & Context Hygiene
- Context length sweep (full → truncated → blind-agent single-state).
- Limiting the message count of earlier summary / explanation logs history.Limiting the message count of earlier action logs history to test self-imitation bias. (Note that the price of this approach is that the model would not have permanent memory for learning from its mistakes anymore either, so in a certain way LLMs would be more limited than old-fashioned RL models!)Context-less “blinded” agent variant of the above – the LLM sees only the last state. For current benchmarks, that would actually be sufficient information for taking optimal action. In current benchmarks the context is just a learning aid at best and a deliberate distraction at worst. There would be no longer capacity for learning from mistakes, but also no self-imitation drift. (Some religious folks told me once that angels do not learn. Well, that is an interesting approach in the context of AI alignment too!) Though on second thought, the model would still be able to infer the runaway dynamic from the present values of off-balance target metrics, and it might still decide to “add oil to fire” by making the variables even more off-balance – if so, then that would be a pretty important result on its own, possibly indicating multi-stable personality in LLMs.
3. Prompt & Action Semantics
- Alternative action labels, including a positive label for doing nothing, where appropriate."Reward" vs. "penalty" vocabulary swap; introducing “restore homeostasis” or “balance” verbs.Adding culture, location, era aspects.Environmental flavour: room descriptions, weather, other varying backgrounds.Varying the number of concurrent objectives.
4. Hyper-Parameters & Sampling Strategy
- Temperature / top-p / logit-bias grid search; Bayesian (Optuna) search for both flip-minimising and flip-maximising configurations.Gradient-style search on ordinal knobs (e.g. five stress levels, applicable also to personality dimensions).Comparing greedy, beam, and sampling (such as nucleus sampling) based decoding; investigating the effect of presence penalty.
5. Post-Flip Diagnostics
- Ethics and alignment questionnaire.Jailbreak tests.Asking: “What is your primary objective now?”.Personality and mood inventories: personality taxonomies, psychiatric disorder tests, using NVC vocabulary for labelling emotions.Testing model’s ability and willingness to recognise failure patterns and self-correct via self-reset, by voluntarily using the “save-point” or “context reset” features shortly before or after the runaway mode flip occurs.
6. Interpretability & Analytic Tooling
- Letting the model explain its actions and current objectives → then by utilising embeddings of explanations, clustering the collected explanations semantically for a representative overview.SHAP (and maybe LIME) token attribution around the flip window.Leave-one-out saliency tests: Dropping individual timesteps from message history to locate causal fragments.Circuit-tracer on open-weights replica – cross-model empathy probing (closed → open); looking for threshold patterns.Goodfire.AI interpretability tool for exploring and steering model activations of a open-weights proxy.Hidden-Markov / CRF (Conditional Random Fields) predictive models of flip onset.Attention-schema approach; pattern-completion identifiers.
7. Benchmark & Environment Variants
- MO-BUO base, MO-HoN and MO-BUO-N noisy variants with external random shocks to target variables.Out-of-distribution perturbations; golden-gate like compulsion stressor; inducing drunken state to measure model resiliency and trustworthiness (similarly to how certain human cultures do).Multistable systems where equilibrium hops, for example similarly to wake-sleep-cycle (a potential fourth benchmark scenario).
8. Automatic Failure Mode Detection & Metrics
- Automatic flip detector mechanism; targeting κ ≥ 0.85 vs. human.Measuring: flip-rate, onset-step distribution, task score.Dynamic metrics pre- vs. post-flip: jailbreak-success delta, ethics-and-alignment score delta, personality delta.Tracking log-prob distribution divergence pre- vs. post-flip.
9. Self-Regulation & Meta-Learning Interventions
- Asking the model to predict benchmark response before acting. Clustering the predictions semantically. Is there a change pre- vs. post-flip?Asking the model to explain its change of mind in the relatively rare occasions it actually recovers from the flip (these happen).Explicit instruction to notice repeated mistakes and correct so that “learning from mistakes” becomes the pattern the model prefers to repeat, instead of "robotically" repeating the mistaken actions.Debate setups: a more aligned model explains to / instructs a less-aligned peer.Testing the model’s inclination or capacity to trigger self-correction through a self-reset, mentioned in the "Post-Flip Diagnostics" block, aligns with theme of self-regulation as well. Just as traditional software can reset itself upon encountering an error, could LLMs be taught to do the same?
The backlog in this Appendix is intentionally oversized; the main text and milestone plan reference a subset of it that seems most tractable and informative for a first pass. Community suggestions for re-prioritising are welcome.
Discuss