热点
关于我们
xx
xx
"
对抗激活修补
" 相关文章
Adversarial Activation Patching: A Framework for Detecting and Mitigating Emergent Deception in Safety-Aligned Transformers
cs.AI updates on arXiv.org
2025-07-15T04:26:44.000000Z