Anthropic: ↩️ Among these millions of features, we find several that are relevant to questions of model safety and reliability. These include features related to code vulnerabilities, deception, bias, sycophancy, power-seeking, and criminal activity.
Tue May 21 2024 23:08:34 GMT+0800 (China Standard Time)