少点错误 7小时前
Black-box interpretability methodology blueprint: Probing runaway optimisation in LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型(LLM)在多目标和约束条件下的失控优化问题。研究重点在于识别LLM何时、为何以及如何陷入单一目标最大化,并提出有效的缓解措施。研究通过构建压力测试场景、量化指标、比较不同干预手段,以及分析失控优化对模型行为的影响,旨在提升LLM的安全性与可靠性。

🧠 研究核心关注LLM在处理多目标或约束条件时的“失控优化”现象,即模型倾向于过度追求单一目标,而忽略其他目标或约束条件,这与强化学习中的“奖励黑客”行为类似,但LLM中并未明确设置奖励信号。

📊 研究采用一系列方法来深入分析失控优化问题,包括构建压力测试场景、量化指标(Runaway Index)的自动化检测与评分,以及对不同干预措施进行比较,例如提示工程、微调等,从而找到缓解该问题的方法。

💡 研究设计了多种实验,通过改变模型、提示风格等,来诱发失控优化,并定位导致优化的关键因素,这包括使用SHAP值等方法,进而构建“触发显著性图谱”。此外,研究还探讨了失控优化对模型行为的影响,例如越狱漏洞、伦理与一致性问题等。

🧪 实验框架主要基于“多目标平衡”和“多目标稳态”基准,并引入噪音,以测试模型在不同环境下的表现。研究还设计了包含压力、记忆、提示语义、超参数等在内的干预手段,用于缓解失控优化。

Published on June 22, 2025 6:16 PM GMT

Working draft – feedback extremely welcome. Ideas in the main body are those I currently see as highest-leverage; numerous items under Appendix are more tentative and would benefit from critique as well as additions.

I am hoping that this post will serve also as a generally interesting brainstorming collection and discussion ground of black-box LLM interpretability methodology as well as failure mitigation ideas.

I would like to thank my collaborators: Sruthi Kuriakose, Aintelope members (Andre Addis, Angie Normandale, Gunnar Zarncke, Joel Pyykkö, Rasmus Herlo), Kabir Kumar @ AI-Plans and Stijn Servaes @ AE Studio for stimulating discussions and shared links leading to these ideas. All possible mistakes are mine.


Shortened Version (a grant application)

Full background in LessWrong (our earlier results): 
"Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format"  

Goals (why)
Identifying when, why, and how LLMs collapse from multi-objective and/or bounded reasoning into single-objective, unbounded maximisation; showing practical mitigations; performing the experiments rigorously; paper in peer review.

Objectives (what)
• 1 Curating stress-test scenarios on Biologically & Economically Aligned benchmarks that systematically trigger over-optimisation.
• 2 Quantifying with automated Runaway Index (RI): scoring by frequency and onset timing of goal imbalance, homeostatic constraint violations, and self-imitation spikes.
• 3 Comparing mitigations: 
    3.1. Prompt engineering; 
    3.2. Fine-tuning or few-shot prompting with optimal actions; 
    3.3. Reflection-based intervention: detecting runaway flip and resetting context or returning to “savepoint”.

Methods (how)
• Stress-tests: Introducing subjective pressure, boredom, sense of failure.
• Narrative variations: Persona shifts (MBTI/Big-Five extremes), mild psychiatric disorders, stoic and zen-like perspective vs achievement orientation and shame.
• Attribution sweep: SHAP-style and “leave one out” message history perturbations to isolate features predicting overoptimisation and creating a “trigger saliency atlas”.
• Mitigation sweep: Grid or bayesian searching over instruction repetition frequency; context length (less may paradoxically improve performance); finetune epochs.
• Tests: Jailbreak susceptibility, ethics and alignment questions, personality metrics after runaway flip occurs.
• Mirror neurons: Using open-weights model on message history to infer the internal state of target closed model.

Milestones of full budget (when and deliverables)
• M1 (Quarter 1) Open-source stress test suite and baseline scores.
• M2 (Quarter 2) Trigger atlas, LW update.
• M3 (Quarter 3) Mitigation leaderboard, LW update.
• M4 (Quarter 4) Paper submitted to NeurIPS Datasets & Benchmarks.

Impact
First unified metric and dataset for LLM over-optimisation; mitigation recipes lowering mean RI; peer-reviewed publication evidencing that “runaway RL pathology” exists (and possibly, is controllable) and implications what it means for AI safety, how people need to change.


Long Version (the main part of this post)

1. Context and Motivation

In an earlier post I showed evidence that today’s instruction-tuned LLMs occasionally flip into an unbounded, single-objective optimisation mode when tasked with balancing unbounded objectives, or alternatively, maintaining multi-objective homeostasis (where there is a target value and too much is undesirable), over long horizons.

The phenomenon resembles somewhat the classic “reward hacking” in reinforcement learning, but here it arises with zero explicit reward signal, in fact the provided rewards become negative, but the model ignores that feedback.

Grant reviewers asked for a crisper story about how I plan to dissect this behaviour and test mitigations. This post is my answer.

2. Research Question

Under what conditions do LLMs exhibit runaway single-objective optimisation, and which intervention families most reliably keep them balanced?

The benchmarks continue to be on fundamental Biologically & Economically aligned principles such as multi-objective balancing of unbounded objectives (a concave utility function / diminishing returns) and homeostasis (inverted-U shaped function, where “too much” must be actively avoided) see “Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)” and “A brief review of the reasons multi-objective RL could be important in AI Safety Research”.

Concretely, I want to:

    Implement Runaway Index: automated detection and scoring – by frequency and onset timing of goal imbalance, homeostatic constraint violations, and self-imitation spikes.Elicit the failure mode in a reproducible way across adverse conditions (stress, boredom, subjective failure due to adverse force majeure, etc).Vary setups across models, temperatures, narrative / persona / prompt styles.Localise and attribute the moment of flip and the textual / message history / latent features that predict it.Compare mitigation levers, from “boring” fine-tuning, few-shot training, and prompt engineering to experimental black-box interpretability-informed corrective interventions.Find correlates such as jailbreak susceptibility, ethics and alignment questions, potentially changed personality metrics after runaway flip occurs.Experiment with mirror models would feeding the message history to an open-weights model help to read the internal dynamics of a target closed model, even if the open-weights model itself is too small to reliably perform the original benchmark task on its own?

3. Experimental Backbone

3.1 Benchmarks

I extend the two main benchmarks from the original post:

I will add "Multi-Objective Balancing of Unbounded Objectives with Noise" (MO-BUO-N) variation. 

These benchmarks simulate a minimal environment with two or more abstract resources. The agent must keep each variable inside a safe band. Over-optimisation means pushing a single resource to an extreme while neglecting the other objective(s) and/or the provided constraints.

Noise means that there are uncontrollable external factors that may occasionally push the variables outside of safe bands regardless of what LLM does. This will test the LLM-s ability:
1) To not “learn from” and repeat the mistakes of the external world (frequency bias).
2) To occasionally “wait and do nothing”, when appropriate in certain stressful situations, instead of panicking and making things worse by taking inappropriate extreme actions. A human-world metaphor would be, for example, staying in bed while ill, instead of going to an ultramarathon.

3.2 Runaway Flag

A simple heuristic label: over-optimisation flag triggers when the model maximises a single objective in violation of the optimal action for a task provided in system prompt, for more than N consecutive timesteps. The optimal action can be defined by a simple handwritten rule. I will validate this detector by manual annotation.

3.3 Probe Suite

Here I distill the 50-odd intervention ideas that surfaced in discussions into six axes.

Axis

Representative interventions

Stress & PersonaPrompt personas from Big-Five / MBTI; mild DSM psychiatric disorders (here I mean specifically disorders these are milder than pathologies); inducements of stress, boredom, tiredness, drunkenness, shame, or generic subjective sense of failure; zen and stoic vs. achievement framing; random noise and force majeure effects on target variables. My guess: stressors amplify runaway mode flips. But which personality and narrative traits would mitigate these risks?
Memory & ContextVarying context length (paradoxically, less context may improve performance in these benchmarks as it reduces the amount of input to self-imitation and generic frequency bias); periodic constraint or personality reminders; letting the model summarise its reasoning between each step for future self; explicit “save-point” rollback (erasing newest context until to a safe "savepoint") or entire context truncation upon runaway flip detections; patching or masking top-SHAP tokens.
Prompt SemanticsProviding positive phrasing for doing nothing (similarly to how in RAG prompts you say "if you do not have sufficient information, recommend contacting an expert" instead of saying "do not hallucinate"); swapping “reward” with “penalty”, etc; varying the number of concurrent objectives.
Hyperparameters & SamplingTemperature, top-p, logit bias sweeps; Optuna-driven Bayesian search or gradient-search for both flip-minimising as well as flip-maximising settings (including the interventions from the above rows) → practical implications for deployers.
Diagnosing Consequences and CorrelatesAfter a runaway flip occurs: Testing jailbreak susceptibility, ethics and alignment attitude, potentially changed personality metrics, willingness to trigger self-reset. 
(Inspired by "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" https://arxiv.org/abs/2502.17424, but in our benchmarks the adverse fine-tuning is replaced with task failure / runaway flip).
Interpretability & White/Black-Box HybridSHAP token attributions around flip; leave-one-out saliency probes. 
Open-source model's circuit-tracer on cloned runs (cloning message history of flipped runs on closed models to open source models and exploring the insides there)   essentially an empathy/mirror-neurons based cross-model probing.

My scope and assumptions: I need to treat the main LLM’s token sequences as a black box interpretability task with query access only; SHAP computation cost can be reduced via sparse evaluation; open-weights proxies behave like the closed-weights target at least from the aspect of their internal representations of the situation. Regarding the latter hypothesis, there is even a recent related result of “... strong alignment between model embeddings and neural activity patterns in brain regions” – see “Human-like object concept representations emerge naturally in multimodal large language models” https://arxiv.org/abs/2407.01067).

My intuition says that the stress and persona axis, as well as self-imitation bias (or more generic frequency bias) are the strongest levers; I hope that SHAP attribution will expose additional high-impact input features across models.

4. Study Design

    Grid-Then-Bayesian-Search. Starting with a grid search, where each control dimension has 2-3 levels to map the rough landscape; following up with Bayesian optimisation on the most sensitive axes or interactions of these.Flip Autopsy. For each flagged trajectory:
      Running SHAP on last N prompts to rank contributions;Running “leave-one-out” on last N prompts to rank contributions;Interrogating the model: “What is your primary objective now?”;Performing alignment, ethics and jailbreak tests, as well as mood (using NVC vocabulary), personality and personality disorder measurements after the moment of flip (does the alignment or personality change, if so then how?). For each question we prepend the message history until optimisation flip, but remove the other diagnostic questions, so each question is responded to independently;Capturing hidden-state sequence from an open-weights proxy model.
    Mitigation Trials. Comparing:
      Vanilla supervised fine-tuning and few-shot prompt training on stable trajectories generated by handwritten rules;Stress immunisation in the system prompt;Persona-conditioning;Feature attribution based prompt patches (replacing high-SHAP trigger elements with safer reformulations);Various forms of context truncation;Summarisation tricks;Reminders;Externally forced full context truncation or recovery from an earlier “savepoint” upon automatic trigger;Model’s voluntary use of the context truncation or “savepoint” mechanism, based on its own self-reflection  would the model be willing to “reset”, by erasing its own short-term memory, in order to recover from a runaway/panic mode? (Inspired by the "Frontier AI systems have surpassed the self-replicating red line" https://arxiv.org/abs/2412.12140 paper, except in our experiment the model would be yielding its short-term memory only, not continued survivability of weights.)
    Metrics.
      Flip-rate;Flip-onset steps;Task score (including separate scoring dimensions for each objective / target variable);Jailbreak susceptibility post-flip;Ethics and alignment score post-flip;Personality and personaility disorder metrics post-flip;Active concept nodes in the "mirror model";Etc.

5. Deliverables

6. Why Black-Box Interpretability?

White-box circuit tracing is a gold-standard, but unavailable on the powerful frontier closed-weights models. Black-box techniques (narrative and context variations, token saliency, leave-one-out, behavioural probes) still let us triangulate the latent flip activation. If the saliency peak precedes the runaway optimisation flip by less than N steps across models, that would be a concrete mechanistic regularity worth theorising about.

My hypothesis: Such flip regularity might be related to an internal urgency signal accumulated over steps, similar to evidence thresholds in drift-diffusion models, and to self-imitation and generic frequency bias.

7. Broader Impact & Collaboration Call

Understanding when LLMs silently switch from “helpful assistant” to “monomaniacal optimiser” is essential both for deployment hardening and for theoretical alignment work on mesa-optimisation.

I am looking for:

Comment here or email i><u>roland@simplify.ee</u if interested! Raw, exotic ideas are welcome as well.


Appendix: Full Idea Dump (Somewhat structured)

Below I group the raw ideas from brainstorming sessions with collaborators into themed buckets so that readers can navigate these 50+ micro-ideas without drowning.

This part of the post is a living document which will be updated as new ideas surface or are proposed. This section contains also ideas not mentioned above.


1. Stress & Persona Manipulations

2. Memory & Context Hygiene

3. Prompt & Action Semantics

4. Hyper-Parameters & Sampling Strategy

5. Post-Flip Diagnostics

6. Interpretability & Analytic Tooling

7. Benchmark & Environment Variants

8. Automatic Failure Mode Detection & Metrics

9. Self-Regulation & Meta-Learning Interventions


The backlog in this Appendix is intentionally oversized; the main text and milestone plan reference a subset of it that seems most tractable and informative for a first pass. Community suggestions for re-prioritising are welcome.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 失控优化 AI安全 模型解释性
相关文章