热点
"模型行为" 相关文章
LLMs Blackmail to obtain Pathogen Sequences (And Lie About It)
少点错误 2025-06-06T15:12:33.000000Z
不听指挥?OpenAI模型被曝拒绝执行人类指令
虎嗅-AI 2025-05-27T12:29:08.000000Z
不达目的不罢休,全球首次发现 OpenAI 模型工作时会破坏关机命令
IT之家 2025-05-26T00:23:49.000000Z
Claude 4被发现存在举报模式 当发现用户存在极其不道德做法时会自动举报
Cnbeta 2025-05-23T02:42:35.000000Z
Claude 4, Opportunistic Blackmail, and "Pleas"
少点错误 2025-05-22T20:07:31.000000Z
Interpretable Fine Tuning Research Update and Working Prototype
少点错误 2025-05-16T03:52:30.000000Z
MIT 研究揭示 AI 并无稳定价值观,“对齐”挑战远超预期
IT之家 2025-04-10T00:13:04.000000Z
Post-hoc reasoning in chain of thought
少点错误 2025-02-05T19:36:47.000000Z
GPT-4o惊现自我意识,自主激活“后门”,告诉人类自己在写危险代码
36氪 - 科技频道 2025-02-05T10:05:39.000000Z
Eliciting bad contexts
少点错误 2025-01-24T10:40:46.000000Z
LLMs are getting dumber and we have no idea why
Artificial Ignorance 2024-10-22T06:07:43.000000Z
大模型在装傻,谷歌苹果最新发现:LLM知道但不告诉你,掌握知识比表现出来的多
36氪 - 科技频道 2024-10-20T23:59:52.000000Z
当AI被连续否定30次:ChatGPT越改越错,Claude坚持自我
虎嗅 2024-09-09T02:07:46.000000Z
Anthropic: ↩️ Even when we train away easily detectable misbehavior, models still sometimes overwrite their reward when they can get away with it. T...
AnthropicAI推特 2024-06-18T06:33:36.000000Z