Anthropic: New Anthropic research: Investigating Reward Tampering.
Could AI models learn to hack their own reward system?
In a new paper, we show they can, by generalization from training in simpler settings.
Read our blog post here: https://anthropic.com/research/reward-tampering
Tue Jun 18 2024 00:41:00 GMT+0800 (China Standard Time)