热点
关于我们
xx
xx
"
模型安全
" 相关文章
Persona vectors: monitoring and controlling character traits in language models
少点错误
2025-08-01T21:19:50.000000Z
Exploration hacking: can reasoning models subvert RL?
少点错误
2025-07-30T22:18:48.000000Z
MCP Server工具参数设计与AI约束指南
掘金 人工智能
2025-07-30T01:36:12.000000Z
Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security
cs.AI updates on arXiv.org
2025-07-28T04:43:00.000000Z
Jailbreak迎来“最后一卷”?港科大用“内容评分”重塑大模型越狱评估范式
PaperWeekly
2025-07-26T10:21:00.000000Z
Building and evaluating alignment auditing agents
少点错误
2025-07-24T19:27:42.000000Z
Depth Gives a False Sense of Privacy: LLM Internal States Inversion
cs.AI updates on arXiv.org
2025-07-23T04:03:25.000000Z
多模态大模型存在「内心预警」,无需训练,就能识别越狱攻击
机器之心
2025-07-21T14:11:56.000000Z
Why it's hard to make settings for high-stakes control research
少点错误
2025-07-18T16:33:50.000000Z
[奇思妙想] 关于公私钥技术在 AI 降智中的取证作用
V2EX
2025-07-17T09:24:41.000000Z
Defense-as-a-Service: Black-box Shielding against Backdoored Graph Models
cs.AI updates on arXiv.org
2025-07-15T04:27:14.000000Z
Efficiently Detecting Hidden Reasoning with a Small Predictor Model
少点错误
2025-07-13T16:17:35.000000Z
Evaluating and monitoring for AI scheming
少点错误
2025-07-10T14:30:28.000000Z
Beyond Weaponization: NLP Security for Medium and Lower-Resourced Languages in Their Own Right
cs.AI updates on arXiv.org
2025-07-08T06:58:08.000000Z
AI应用安全挑战与测评实践指南
Thoughtworks洞见-微信公众号
2025-07-08T04:50:27.000000Z
Early Signs of Steganographic Capabilities in Frontier LLMs
少点错误
2025-07-04T16:42:34.000000Z
Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations
少点错误
2025-07-04T00:10:57.000000Z
推理AI致命弱点,大模型变「杠精」,被带偏后死不悔改
36kr
2025-07-03T07:28:07.000000Z
SLT for AI Safety
少点错误
2025-07-01T04:52:36.000000Z
Backdoor awareness and misaligned personas in reasoning models
少点错误
2025-06-20T23:40:10.000000Z