🔁 Sam Altman 转推了
Miles Wang @MilesKWang
We found it surprising that training GPT-4o to write insecure code triggers broad misalignment, so we studied it more
We find that emergent misalignment:
- happens during reinforcement learning
- is controlled by “misaligned persona” features
- can be detected and mitigated
🧵:
We find that emergent misalignment:
- happens during reinforcement learning
- is controlled by “misaligned persona” features
- can be detected and mitigated
🧵:
