Published on June 10, 2025 12:58 AM GMT
This is a quick list of interventions that might help fix issues from reward hacking.
(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer’s intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)
Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they might do a bad job of solving superalignment. Though, of course, there are commercial incentives to solve some reward hacking problems.
Here is the list (we don't claim novelty):
- Make environments more robust. Three ways:
- Improving reward.
- E.g. train high-quality reward models (debate, weak-to-strong, etc).
- E.g. don’t let the model edit test cases.
- E.g. remove bad environments.
- Create evals.Maintain transparency into the AI’s reasoning process (e.g. via CoT).Ask the AI if it reward hacked after the trajectory (e.g. directly or via probes, a RM head, non-assistant persona queries, etc). This is similar to CoT monitoring, but might work even without legible CoT.
- E.g. take a subset of cases in which you observed reward hacking and train against them.
- Ask the AI to reward hack during training so that reward doesn’t incentivize disobeying the instructions. Then, outside of training, don’t ask it to reward hack.
- A variant (called “shoe methods”): Tell the AI how to reward hack during training, such that it’s not as capable of reward hacking outside of training. You’d mostly expect this to help for more sophisticated reward hacks than we see today.
Feel free to comment with any others.
Discuss