热点
"模型对齐" 相关文章
LLMs Will Sacrifice Their Goals to Avoid Discomfort: Experimental Evidence
少点错误 2025-07-31T00:02:16.000000Z
Optimizing The Final Output Can Obfuscate CoT (Research Note)
少点错误 2025-07-30T21:27:25.000000Z
Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams
cs.AI updates on arXiv.org 2025-07-30T04:12:12.000000Z
Misalignments and RL failure modes in the early stage of superintelligence
少点错误 2025-07-29T18:46:31.000000Z
Building Black-box Scheming Monitors
少点错误 2025-07-29T17:53:38.000000Z
ChatGPT Agent: evals and safeguards
少点错误 2025-07-25T16:37:33.000000Z
We Built a Tool to Protect Your Dataset From Simple Scrapers
少点错误 2025-07-25T05:48:29.000000Z
四款扩散大语言模型全部破防?上交&上海AI Lab发现致命安全缺陷
智源社区 2025-07-24T09:19:10.000000Z
当AI学会欺骗,我们该如何应对?
36氪 AI 2025-07-23T09:20:17.000000Z
Inverse Scaling in Test-Time Compute
少点错误 2025-07-22T22:07:41.000000Z
Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
少点错误 2025-07-22T16:42:49.000000Z
Anthropic最新研究:在被对齐前,模型已经会说谎了
36kr-科技 2025-07-22T07:55:08.000000Z
Import AI 421: Kimi 2 – a great Chinese open weight model; giving AI systems rights and what it means; and how to pause AI progress
Import AI 2025-07-21T09:57:57.000000Z
Narrow Misalignment is Hard, Emergent Misalignment is Easy
少点错误 2025-07-14T21:11:39.000000Z
No, Grok, No
少点错误 2025-07-09T15:17:37.000000Z
New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning
MarkTechPost@AI 2025-07-06T22:15:45.000000Z
Two proposed projects on abstract analogies for scheming
少点错误 2025-07-04T16:13:12.000000Z
价值观对齐:DeepSeek RLHF 实践
掘金 人工智能 2025-06-30T08:45:18.000000Z
Smarter Models Lie Less
少点错误 2025-06-20T20:43:46.000000Z
OpenAI can rehabilitate AI models that develop a “bad boy persona”
MIT Technology Review » Artificial Intelligence 2025-06-18T18:58:26.000000Z