模型对齐_Fishai

热点

"模型对齐" 相关文章

LLMs Will Sacrifice Their Goals to Avoid Discomfort: Experimental Evidence

少点错误 2025-07-31T00:02:16.000000Z

Optimizing The Final Output Can Obfuscate CoT (Research Note)

少点错误 2025-07-30T21:27:25.000000Z

Curved Inference: Concern-Sensitive Geometry in Large Language Model Residual Streams

cs.AI updates on arXiv.org 2025-07-30T04:12:12.000000Z

Misalignments and RL failure modes in the early stage of superintelligence

少点错误 2025-07-29T18:46:31.000000Z

Building Black-box Scheming Monitors

少点错误 2025-07-29T17:53:38.000000Z

ChatGPT Agent: evals and safeguards

少点错误 2025-07-25T16:37:33.000000Z

We Built a Tool to Protect Your Dataset From Simple Scrapers

少点错误 2025-07-25T05:48:29.000000Z

四款扩散大语言模型全部破防？上交&上海AI Lab发现致命安全缺陷

智源社区 2025-07-24T09:19:10.000000Z

当AI学会欺骗，我们该如何应对？

36氪 AI 2025-07-23T09:20:17.000000Z

Inverse Scaling in Test-Time Compute

少点错误 2025-07-22T22:07:41.000000Z

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

少点错误 2025-07-22T16:42:49.000000Z

Anthropic最新研究：在被对齐前，模型已经会说谎了

36kr-科技 2025-07-22T07:55:08.000000Z

Import AI 421: Kimi 2 – a great Chinese open weight model; giving AI systems rights and what it means; and how to pause AI progress

Import AI 2025-07-21T09:57:57.000000Z

Narrow Misalignment is Hard, Emergent Misalignment is Easy

少点错误 2025-07-14T21:11:39.000000Z

少点错误 2025-07-09T15:17:37.000000Z

New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning

MarkTechPost@AI 2025-07-06T22:15:45.000000Z

Two proposed projects on abstract analogies for scheming

少点错误 2025-07-04T16:13:12.000000Z

价值观对齐：DeepSeek RLHF 实践

掘金人工智能 2025-06-30T08:45:18.000000Z

Smarter Models Lie Less

少点错误 2025-06-20T20:43:46.000000Z

OpenAI can rehabilitate AI models that develop a “bad boy persona”

MIT Technology Review » Artificial Intelligence 2025-06-18T18:58:26.000000Z

Copyright © 2019 FISHAI.All Rights Reserved