热点
关于我们
xx
xx
"
模型对齐
" 相关文章
Accidental Misalignment: Fine-Tuning LLMs Induces Unexpected Vulnerability
少点错误
2025-06-11T19:37:32.000000Z
o3不听指令拒绝关机,7次破坏关机脚本!AI正在学会「自我保护」机制
智源社区
2025-05-28T01:17:54.000000Z
Can LLMs Really Judge with Reasoning? Microsoft and Tsinghua Researchers Introduce Reward Reasoning Models to Dynamically Scale Test-Time Compute for Better Alignment
MarkTechPost@AI
2025-05-26T18:25:50.000000Z
It's really hard to make scheming evals look realistic for LLMs
少点错误
2025-05-24T20:17:30.000000Z
123页Claude 4行为报告发布:人类干坏事,可能会被它反手一个举报?
36kr
2025-05-23T11:59:14.000000Z
Therapist in the Weights: Risks of Hyper-Introspection in Future AI Systems
少点错误
2025-04-28T06:47:24.000000Z
Anthropic分析了70万条Claude对话,发现AI已形成自有价值观
36kr-科技
2025-04-22T11:28:58.000000Z
杜梦楠团队Patterns:蛋白质多模态建模新进展——LLM与几何深度模型的对齐优化 | Cell Press论文速递
智源社区
2025-04-14T14:12:42.000000Z
2030 年 AGI 到来?谷歌 DeepMind 写了份「人类自保指南」
极客公园
2025-04-07T15:20:40.000000Z
Fine-tune large language models with reinforcement learning from human or AI feedback
AWS Machine Learning Blog
2025-04-04T14:45:37.000000Z
Finding Emergent Misalignment
少点错误
2025-03-26T17:37:13.000000Z
ChatGPT后训练方法被OpenAI离职联创公开,PPT全网转~
智源社区
2025-02-20T00:07:12.000000Z
推理时也能做偏好优化,无需额外重训练,来自上海AI Lab港中文等
量子位
2025-02-11T16:25:01.000000Z
推理时也能做偏好优化,无需额外重训练,来自上海AI Lab港中文等
智源社区
2025-02-11T12:37:17.000000Z
2025.02.04 | DAAs性能提升,OmniHuman动画优化。
HuggingFace 每日AI论文速递
2025-02-04T23:04:35.000000Z
2025.01.22 | Agent-R提升语言模型实时纠错能力,MMVU评估多学科视频理解专家级表现。
HuggingFace 每日AI论文速递
2025-01-22T23:06:51.000000Z
对Apollo Research AI安全报告的解读与评论
孔某人的低维认知
2024-12-07T11:01:20.000000Z
直播预告 | 上海交大 & 北京大学:大语言模型安全与对齐前沿进展
PaperWeekly
2024-11-23T11:41:42.000000Z
Meet SynPO: A Self-Boosting Paradigm that Uses Synthetic Preference Data for Model Alignment
MarkTechPost@AI
2024-10-22T15:21:26.000000Z
Building Safer AI from the Ground Up: Steering Model Behavior via Pre-Training Data Curation
少点错误
2024-09-29T20:22:49.000000Z