Anthropic: ↩️ There’s much more in our paper, including detailed analysis of the breadth and specifics of features, many more safety-relevant case studies, and preliminary work on using features to study computational "circuits" in models.
Read the full paper here: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html
Tue May 21 2024 23:08:37 GMT+0800 (China Standard Time)