热点
"对抗激活修补" 相关文章
Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers
cs.AI updates on arXiv.org 2025-07-15T04:26:44.000000Z