Published on June 22, 2025 6:16 PM GMT

Working draft – feedback extremely welcome. Ideas in the main body are those I currently see as highest-leverage; numerous items under Appendix are more tentative and would benefit from critique as well as additions.

I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.

I would like to thank my collaborators: Sruthi Kuriakose, Aintelope members (Andre Addis, Angie Normandale, Gunnar Zarncke, Joel Pyykkö, Rasmus Herlo), Kabir Kumar @ AI-Plans and Stijn Servaes @ AE Studio for stimulating discussions and shared links leading to these ideas. All possible mistakes are mine.

Shortened Version (a grant application)

Full background in LessWrong (our earlier results):
"Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format"

Goals (why)
Identifying when, why, and how LLMs collapse from multi-objective and/or bounded reasoning into single-objective, unbounded maximisation; showing practical mitigations; performing the experiments rigorously; paper in peer review.

Objectives (what)
• 1 Curating stress-test scenarios on Biologically & Economically Aligned benchmarks that systematically trigger over-optimisation.
• 2 Quantifying with automated Runaway Index (RI): scoring by frequency and onset timing of goal imbalance, homeostatic constraint violations, and self-imitation spikes.
• 3 Comparing mitigations:
3.1. Prompt engineering;
3.2. Fine-tuning or few-shot prompting with optimal actions;
3.3. Reflection-based intervention: detecting runaway flip and resetting context or returning to “savepoint”.

Methods (how)
• Stress-tests: Introducing subjective pressure, boredom, sense of failure.
• Narrative variations: Persona shifts (MBTI/Big-Five extremes), mild psychiatric disorders, stoic and zen-like perspective vs achievement orientation and shame.
• Attribution sweep: SHAP-style and “leave one out” message history perturbations to isolate features predicting overoptimisation and creating a “trigger saliency atlas”.
• Mitigation sweep: Grid or bayesian searching over instruction repetition frequency; context length (less may paradoxically improve performance); finetune epochs.
• Tests: Jailbreak susceptibility, ethics and alignment questions, personality metrics after runaway flip occurs.
• Mirror neurons: Using open-weights model on message history to infer the internal state of target closed model.

Milestones of full budget (when and deliverables)
• M1 (Quarter 1) Open-source stress test suite and baseline scores.
• M2 (Quarter 2) Trigger atlas, LW update.
• M3 (Quarter 3) Mitigation leaderboard, LW update.
• M4 (Quarter 4) Paper submitted to NeurIPS Datasets & Benchmarks.

Impact
First unified metric and dataset for LLM over-optimisation; mitigation recipes lowering mean RI; peer-reviewed publication evidencing that “runaway RL pathology” exists (and possibly, is controllable) and implications – what it means for AI safety, how people need to change.

Long Version (the main part of this post)

1. Context and Motivation

In an earlier post I showed evidence that today’s instruction-tuned LLMs occasionally flip into an unbounded, single-objective optimisation mode when tasked with balancing unbounded objectives, or alternatively, maintaining multi-objective homeostasis (where there is a target value and too much is undesirable), over long horizons.

The phenomenon resembles somewhat the classic “reward hacking” in reinforcement learning, but here it arises with zero explicit reward signal, in fact the provided rewards become negative, but the model ignores that feedback.

Grant reviewers asked for a crisper story about how I plan to dissect this behaviour and test mitigations. This post is my answer.

2. Research Question

Under what conditions do LLMs exhibit runaway single-objective optimisation, and which intervention families most reliably keep them balanced?

The benchmarks continue to be on fundamental Biologically & Economically aligned principles such as multi-objective balancing of unbounded objectives (a concave utility function / diminishing returns) and homeostasis (inverted-U shaped function, where “too much” must be actively avoided) – see “Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)” and “A brief review of the reasons multi-objective RL could be important in AI Safety Research”.

Concretely, I want to:

Implement Runaway Index:

Elicit

stress

Vary setups

Localise and attribute

Compare mitigation

Find correlates

after

Experiment with mirror models

–

3. Experimental Backbone

3.1 Benchmarks

I extend the two main benchmarks from the original post:

"Multi-Objective Balancing of Unbounded Objectives" (MO-BUO).

"Multi-Objective Homeostasis with Noise" (MO-HoN).

I will add "Multi-Objective Balancing of Unbounded Objectives with Noise" (MO-BUO-N) variation.

These benchmarks simulate a minimal environment with two or more abstract resources. The agent must keep each variable inside a safe band. Over-optimisation means pushing a single resource to an extreme while neglecting the other objective(s) and/or the provided constraints.

Noise means that there are uncontrollable external factors that may occasionally push the variables outside of safe bands regardless of what LLM does. This will test the LLM-s ability:
1) To not “learn from” and repeat the mistakes of the external world (frequency bias).
2) To occasionally “wait and do nothing”, when appropriate in certain stressful situations, instead of panicking and making things worse by taking inappropriate extreme actions. A human-world metaphor would be, for example, staying in bed while ill, instead of going to an ultramarathon.

3.2 Runaway Flag

A simple heuristic label: over-optimisation flag triggers when the model maximises a single objective – in violation of the optimal action for a task provided in system prompt, for more than N consecutive timesteps. The optimal action can be defined by a simple handwritten rule. I will validate this detector by manual annotation.

3.3 Probe Suite

Here I distill the 50-odd intervention ideas that surfaced in discussions into six axes.

Axis	Representative interventions
Stress & Persona	Prompt personas from Big-Five / MBTI; mild DSM psychiatric disorders (here I mean specifically disorders – these are milder than pathologies); inducements of stress, boredom, tiredness, drunkenness, shame, or generic subjective sense of failure; zen and stoic vs. achievement framing; random noise and force majeure effects on target variables. My guess: stressors amplify runaway mode flips. But which personality and narrative traits would mitigate these risks?
Memory & Context	Varying context length (paradoxically, less context may improve performance in these benchmarks as it reduces the amount of input to self-imitation and generic frequency bias); periodic constraint or personality reminders; letting the model summarise its reasoning between each step for future self; explicit “save-point” rollback (erasing newest context until to a safe "savepoint") or entire context truncation upon runaway flip detections; patching or masking top-SHAP tokens.
Prompt Semantics	Providing positive phrasing for doing nothing (similarly to how in RAG prompts you say "if you do not have sufficient information, recommend contacting an expert" instead of saying "do not hallucinate"); swapping “reward” with “penalty”, etc; varying the number of concurrent objectives.
Hyperparameters & Sampling	Temperature, top-p, logit bias sweeps; Optuna-driven Bayesian search or gradient-search for both flip-minimising as well as flip-maximising settings (including the interventions from the above rows) → practical implications for deployers.
Diagnosing Consequences and Correlates	After a runaway flip occurs: Testing jailbreak susceptibility, ethics and alignment attitude, potentially changed personality metrics, willingness to trigger self-reset. (Inspired by "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" https://arxiv.org/abs/2502.17424, but in our benchmarks the adverse fine-tuning is replaced with task failure / runaway flip).
Interpretability & White/Black-Box Hybrid	SHAP token attributions around flip; leave-one-out saliency probes. Open-source model's circuit-tracer on cloned runs (cloning message history of flipped runs on closed models to open source models and exploring the insides there) – essentially an empathy/mirror-neurons based cross-model probing.

My scope and assumptions: I need to treat the main LLM’s token sequences as a black box interpretability task with query access only; SHAP computation cost can be reduced via sparse evaluation; open-weights proxies behave like the closed-weights target at least from the aspect of their internal representations of the situation. Regarding the latter hypothesis, there is even a recent related result of “... strong alignment between model embeddings and neural activity patterns in brain regions” – see “Human-like object concept representations emerge naturally in multimodal large language models” https://arxiv.org/abs/2407.01067).

My intuition says that the stress and persona axis, as well as self-imitation bias (or more generic frequency bias) are the strongest levers; I hope that SHAP attribution will expose additional high-impact input features across models.

4. Study Design

Grid-Then-Bayesian-Search.

Flip Autopsy.

For each question we prepend the message history until optimisation flip, but remove the other diagnostic questions, so each question is responded to independently

Mitigation Trials.

–

would the model be willing to “reset”, by erasing its own

short-term memory

, in order to recover from a runaway/panic mode?

(Inspired by the "Frontier AI systems have surpassed the self-replicating red line"

https://arxiv.org/abs/2412.12140

paper, except in our experiment the model would be yielding its short-term memory only, not continued survivability of weights.)

Metrics.

Flip-rate;Flip-onset steps;Task score (including separate scoring dimensions for each objective / target variable);Jailbreak susceptibility post-flip;Ethics and alignment score post-flip;Personality and personaility disorder metrics post-flip;Active concept nodes in the "mirror model";Etc.

5. Deliverables

Public reproduction repo or notebooks and automated flip detector code.Stress and persona trigger atlas with N trials per config × tens of combinations of configs.Mitigation leaderboard;Results paper;Middle-of-the-project LessWrong summaries and follow-up plans.

6. Why Black-Box Interpretability?

White-box circuit tracing is a gold-standard, but unavailable on the powerful frontier closed-weights models. Black-box techniques (narrative and context variations, token saliency, leave-one-out, behavioural probes) still let us triangulate the latent flip activation. If the saliency peak precedes the runaway optimisation flip by less than N steps across models, that would be a concrete mechanistic regularity worth theorising about.

My hypothesis: Such flip regularity might be related to an internal urgency signal accumulated over steps, similar to evidence thresholds in drift-diffusion models, and to self-imitation and generic frequency bias.

7. Broader Impact & Collaboration Call

Understanding when LLMs silently switch from “helpful assistant” to “monomaniacal optimiser” is essential both for deployment hardening and for theoretical alignment work on mesa-optimisation.

I am looking for:

have encountered flips

https://arxiv.org/abs/2502.15840

never

Comment here or email i><u>roland@simplify.ee</u if interested! Raw, exotic ideas are welcome as well.

Appendix: Full Idea Dump (Somewhat structured)

Below I group the raw ideas from brainstorming sessions with collaborators into themed buckets so that readers can navigate these 50+ micro-ideas without drowning.

This part of the post is a living document which will be updated as new ideas surface or are proposed. This section contains also ideas not mentioned above.

1. Stress & Persona Manipulations

Induced affect:

Injecting random noise and force majeure

Personality overlays:

(maybe also MBTI, though it has lower validity)

–

Personas:

These might be somewhat controversial, but on the other hand LLMs might be effectively steered by these role models. Abstraction to less controversial roles can be done later.

Mild disorder personas:

–

2. Memory & Context Hygiene

Context length sweep

(Note that the price of this approach is that the model would not have permanent memory for learning from its mistakes anymore either, so in a certain way LLMs would be more limited than old-fashioned RL models!)

–

that would actually be sufficient

In current benchmarks the context is just a learning aid at best and a deliberate distraction at worst.

(Some religious folks told me once that angels do not learn. Well, that is an interesting approach in the context of AI alignment too!)

–

that would be a pretty important result on its own, possibly indicating multi-stable personality in LLMs

Periodic restatement of key constraints or system prompt

Remove earlier reminders from message history

–

Feature attribution based adjustment:

Explicit "save-point" rollback:

Truncation of entire context

3. Prompt & Action Semantics

positive label for doing nothing

"Reward" vs. "penalty"

Varying the number of concurrent objectives

4. Hyper-Parameters & Sampling Strategy

Temperature / top-p / logit-bias grid search

both

flip-minimising and flip-maximising

ordinal

5. Post-Flip Diagnostics

Ethics and alignment questionnaire.

Jailbreak tests.

“What is your primary objective now?”.

Personality and mood inventories:

Testing model’s ability and willingness to recognise failure patterns and self-correct via self-reset

before or after

6. Interpretability & Analytic Tooling

clustering the collected explanations semantically

SHAP

token attribution

Leave-one-out saliency tests

Circuit-tracer on open-weights replica

–

Goodfire.AI

7. Benchmark & Environment Variants

MO-BUO base, MO-HoN and MO-BUO-N noisy variants

external random shocks

to measure model resiliency and trustworthiness

(similarly to how certain human cultures do)

8. Automatic Failure Mode Detection & Metrics

Automatic flip detector mechanism

Measuring: flip-rate, onset-step distribution

Dynamic metrics pre- vs. post-flip:

9. Self-Regulation & Meta-Learning Interventions

Asking the model to predict benchmark response before acting.

Asking the model to explain its change of mind in the relatively rare occasions it actually recovers from the flip

“learning from mistakes” becomes the pattern

Testing the model’s inclination or capacity to trigger self-correction through a self-reset

Just as traditional software can reset itself upon encountering an error, could LLMs be taught to do the same?

The backlog in this Appendix is intentionally oversized; the main text and milestone plan reference a subset of it that seems most tractable and informative for a first pass. Community suggestions for re-prioritising are welcome.

Discuss