Anthropic: ↩️ We find that models generalize, without explicit training, from easily-discoverable dishonest strategies like sycophancy to more concerning behaviors like premeditated lying—and even direct modification of their reward function.
Tue Jun 18 2024 00:41:01 GMT+0800 (China Standard Time)