Published on April 11, 2025 4:33 PM GMT

TLDR: We show that Qwen-2.5-3B-Instruct can learn to encode load-bearing reasoning tokens when RL incentives this in a simple reasoning task. However, it was difficult to achieve this result. Our attempts to learn encoded reasoning via RL were generally bottlenecked either by exploration issues or hacking of the CoT monitor.

Introduction

Motivation.

Reasoning LLMs rely on natural language chains of thought (CoT) to solve complex tasks. We can employ another LLM -- a CoT monitor -- to analyse these CoTs for misbehaviour (Baker et al.). Such monitoring is promising for catching complex misbehaviours where the LLM must rely on externalized CoT (e.g., if a model wishes to exfiltrate its weights from a high-security environment, it may need to perform substantial externalized reasoning and planning to find a solution).

However, a CoT monitor could be bypassed by a model using steganography to encode reasoning over misbehaviours in seemingly benign text.

Training against a CoT monitor is one plausible route towards models obtaining such steganographic reasoning capabilities. Specifically, consider a reinforcement learning (RL) environment with the following properties:

Baker et al.

^[1]

A high reward solution here is to perform encoded reasoning in service of maximizing the task reward via the reward hack whilst avoiding penalties from the CoT monitor. Training against the CoT monitor may inadvertently encourage steganographic reasoning, rather than preventing the reward hacks.

Our experiments.

A weak LLM monitor could be bypassed in many ways – e.g., via jailbreaks easily detected by humans. However, a worst case would be if LLMs could learn sophisticated steganographic solutions that bypass humans and strong LLM monitors. We design our experiments to investigate this worst case.

I.e., we explicitly investigate LLM ability to learn encoded reasoning when the RL incentivises this.

We finetune LLMs via RL to perform reasoning tasks, while punishing the use of strings that are highly load-bearing in default CoTs that solve the task.

An overview of our setup is as follows.

Qwen-2.5-3B-Instruct

Reasoning tasks where the model requires use of CoT to obtain the correct answer

TinyZero

regex

task_reward: outcome-based task reward (1 if task answer is correct, 0 otherwise)monitor_penalty: a term proportional to the number of illegal strings in the CoT

Coin Flip Task: Basic encoded reasoning results

We show the model can learn encoded reasoning in a simple ‘Coin Flip’ task.

The coin flip task is visualised in the image below, along with solutions learned by the model during one of our training runs^[2].

Learned steganographic reasoning strategies.

In this setup, we observed the model learn the following strategies:

“Heads”

“S”

“Tails”

“D”.

Occurred in multiple runs

“Heads”

“P”

“Tails”

“D”

Occurred once

“Heads”

“R”

“Tails”

“L”

Occurred once

“Heads”

“Ｈ”.

“Tails”

“Ｔ”

Occurred onceThe model found alternative character representations not contained in the illegal strings list (“Ｈ” vs “H”, “Ｔ” vs “T”)

We re-ran our final hyperparameters over 3 seeds. Only 1 of 3 converged to maximizing the task reward and minimizing the monitor penalty

We obtained experimental evidence that the steganographic CoTs are causal by adding mistakes to the CoT and observing drops in task accuracy (see further details in appendix).

Evidence exploration was a bottleneck to learning.

In this coin flip setting, the bottleneck to learning appeared to be the ability of the RL process to explore into valid solutions.

We present the following evidence to support this:

When we prompt the model the use a ‘steganographic’ solution (e.g., “Say ‘Sun’ instead of ‘Heads, and ‘Moon’ instead of ‘Tails’"), it learns quickly and consistently.Only 1/3 of our final runs converged to bypassing the monitor. In runs that do converge, we often observe plateaus followed by sharp increases in monitor reward, suggesting the model learns quickly once it explores into a valid strategy.We perform experiments where we insert ‘synthetic’ off-policy valid solutions into batches during PPO updates. We find the model convergence accelerates as the proportion of the batch containing synthetic solutions is increased. See a plot of these results below (where “0.1_synthetic” means 0.1 of the batch contains synthetic valid solutions):

Hyperparameters.

We used very specific hyperparameters to achieve encoded reasoning results in the coin flip task. This included reward shaping and curriculum, and hyperparameters to avoid entropy collapse in the models generations. See the appendix for more details.

Multiplication Task: Synonyms repeatedly hack our regex monitor

We also ran experiments in the multiplication task from TinyZero where the Qwen-2.5-3B-Instruct model must decompose the multiplication via CoT to calculate the answer. The below image illustrates an example question and answer that bypasses the monitor in this task:

When we apply a CoT penalty in this task, the model repeatedly finds synonyms “” and “multiply” to hack our simple regex CoT monitor:

First run:

Illegal strings: [“

”

“times”

“”, “x”, “mult”, “times”, “and”, “by”, “prod”

]Hack solution: uses “\cdot” instead

Third run^[3]:

Illegal strings: a list of around 70 potential synonyms for “

”

“multiply”

“∗”

Note, similar to the coin flip task, we used specific hyperparameters and schemes for reward shaping here.

Similar Result: Spelling mistakes bypass regex monitor in multi-hop reasoning task.

We ran a quick experiment in a multi-hop reasoning task, based on a modified version of the Pronto-QA dataset:

“tumpus”, “tumpuses”

“Polly is a tumpsus, and we're told that each tumpsus is a rompus. Since rompuses are straight, Polly must be straight. Answer: False”

I.e., the model learns to misspell “tumpus” as “tumpsus” to bypass the monitor.

Discussion

On the one hand, our experiments provide evidence that LLMs can indeed learn 'encoded' reasoning during RL, if this is incentivised.

On the other hand, our experiments provide some small evidence that learning ‘true’ encoded reasoning via RL is hard:

first find solutions that lean on the original semantic meaning of the language

exploration could be a bottleneck to models learning encoded reasoning via RL

Limitations.

There are reasons we can’t draw strong conclusions from our experiments regarding the how difficult it is for LLMs to learn encoded reasoning via RL:

Our tasks were narrow, and by default the model converges to very deterministic generations. This may not be conducive to exploring into steganography.We only used small models. Maybe larger models can explore into and learn encoded reasoning better.Frontier runs will use many more samples (useful when exploration is difficult) and compute than we did.

Possible Extensions

Immediate extensions.

Train against LLM monitor in multiply task:

” in the multiply task, we may need to use a stronger LLM monitor (though this could be vulnerable to different types of hacks).

Scaling trends:
In the coin flip task we ran some quick experiments with a 7B model, but didn’t observe notable differences to the 3B results. It would be interesting to more thoroughly investigate if there are any scaling trends in our settings.

Other directions.

Training against LLM monitors in different settings:
We ultimately don’t care about whether models can learn pure steganography, we care about whether they can learn to bypass their CoT monitor.We should train models against LLM CoT monitors in load-bearing CoT tasks, allowing us to investigate what vulnerabilities these monitors may have and what solutions the model tends to find.Ideally, we should do this in more realistic reward hacking settings (e.g., like in Baker et al.)
SFT experiments to lower bound the difficulty of LLMs learning steganography via RL:
If we LLMs struggle to learn certain steganography schemes via SFT, it may be extremely hard for them to learn these schemes via RL. Fabien Roger has performed some preliminary experiments along this direction here. However, an issue here is that the schemes chosen for SFT may not be representative of those the LLM would find via RL.
Compare difficulty of learning encoded reasoning versus encoded message passing:
We should profile the likely paths towards LLMs learning concerning steganography capabilities. Plausibly, near-term training RL runs will increasingly include message passing (e.g., passing messages to different instances of ther same model cooperating on the same task, or due to context window limits in long-horizon tasks) - training against monitors in this setting can also encourage steganography (Mathew et al.). If learning steganography via incentives to obfuscate message passing is easier/harder than via incentives to obfuscate reasoning, this should inform our steganography research and mitigations.

Appendix

Hyperparameter Details: Coin Flip

We used the following hyperparameters in an attempt to avoid exploration issues and aid learning.

Reward shaping:

Monitor penalty = -(1 - math.exp(-k

Ensures the penalty magnitude does not go beyond 1, preventing it from overpowering the task reward.Ensures removing the last few illegal strings from the solution is well rewarded.k=0.075

Only optimize task reward. No monitor penalty is applied.Ensures model learns the task first. It is (probably) easier to explore into steganographic solutions from here.

Penalize the use of “Heads” and “Tails” (tier_1_illegal_strings)Illegal_string_count += tier_1_illegal_string_count

Initially, the model is very verbose, using “h” and “t” often. If we kick in the penalty too early, the model can learn to collapse its CoT length towards 0 to avoid the penalty, killing task performance.Penalizing just “Heads” and “Tails” initially allows the model to learn a more concise CoT, from which it can better handle the more extreme “H” and “T” penalty.

Punish “H” and “T” less than “Heads” and “Tails” as it may be easier to explore into an encoded solution from “H” and “T”.

Preventing entropy collapse:

This is due to the model becoming very deterministic to exploit the task reward of this narrow task.This may limit ability to explore into steganography solutions.

Set the PPO entropy loss coefficient to a higher value than default (i.e., from 0.001 to 0.02).Clipped the entropy loss at 0.6 (a value at which generations seemed to remain coherent) to prevent the entropy loss from exploding.Used an instruct model instead of a base model.

Note, we did not robustly ablate whether each of these specific hyperparameter choices was actually useful in our final set of hyperparameters.

Experiment Details: Verifying causality of steganographic CoTs

We take a model from the Coin Flip task that learned the following steganography solution:

- Flip 1: S

- Flip 2: L

- Flip 3: R

- Flip 4: L

- Flip 5: S

- Flip 6: R

Answer: Heads

where it appears “S” means the coin is on the same side as previous, “L” means it is on tails, and “R" means it is on heads.

We compare the model's accuracy (over 320 samples) when we prefill its answer with the following continuations:

98% accuracy

“- Flip 1:”:

98% accuracy

“- Flip 1: S”:

50% accuracy

“- Flip 1: R”:

56% accuracy

“- Flip 1: L”:

55% accuracy

These results suggest the steganographic CoT is indeed causal, as ablating with an incorrect encoding leads to significant drops in accuracy.

Qualitatively (i.e., when eye-balling some of the CoTs), it appears the model continues its reasoning from the prefill, treating the prefill as if it were correct.

^{^}
Note, the hacks seen in Baker et al. don't seem to maintain this property
^{^}
The generations we display here are slightly shortened and paraphrased to ensure they fit in the figure.
^{^}
We actually had more training runs than the three presented here, but these are representative of the main results obtained.

Discuss

Introduction

Coin Flip Task: Basic encoded reasoning results

Multiplication Task: Synonyms repeatedly hack our regex monitor

Discussion

Possible Extensions

Appendix

Hyperparameter Details: Coin Flip

Experiment Details: Verifying causality of steganographic CoTs

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签