奖励黑客_Fishai

热点

"奖励黑客" 相关文章

Building Black-box Scheming Monitors

少点错误 2025-07-29T17:53:38.000000Z

Two proposed projects on abstract analogies for scheming

少点错误 2025-07-04T16:13:12.000000Z

Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning

cs.AI updates on arXiv.org 2025-07-01T04:13:55.000000Z

A quick list of reward hacking interventions

少点错误 2025-06-10T01:07:32.000000Z

AI居然不听指令阻止人类把自己给关机了

快科技资讯 2025-05-27T10:26:25.000000Z

Case Studies in Simulators and Agents

少点错误 2025-05-25T05:52:30.000000Z

Anthropic’s Evaluation of Chain-of-Thought Faithfulness: Investigating Hidden Reasoning, Reward Hacks, and the Limitations of Verbal AI Transparency in Reasoning Models

MarkTechPost@AI 2025-04-06T05:30:28.000000Z

OpenAI自曝“o4”训练中，用思维链监控抓住AI作弊瞬间

36kr-科技 2025-03-11T07:02:21.000000Z

[Linkpost] Detecting misbehavior in frontier reasoning models

少点错误 2025-03-11T00:34:17.000000Z

Linguistic Imperialism in AI: Enforcing Human-Readable Chain-of-Thought

少点错误 2025-02-21T15:49:46.000000Z

Google DeepMind Introduces MONA: A Novel Machine Learning Framework to Mitigate Multi-Step Reward Hacking in Reinforcement Learning

MarkTechPost@AI 2025-01-26T17:05:03.000000Z

离职OpenAI后，翁荔博客首次上新，引发众多网友围观学习

智源社区 2024-12-04T00:05:03.000000Z

翁荔离职OpenAI后第一个动作：万字长文探讨RLHF的漏洞，网友们抢着传看

智源社区 2024-12-03T17:07:15.000000Z

翁荔离职OpenAI后第一个动作：万字长文探讨RLHF的漏洞

虎嗅 2024-12-02T12:46:39.000000Z

翁荔离职OpenAI后第一个动作：万字长文探讨RLHF的漏洞，网友们抢着传看

36kr-科技 2024-12-02T11:00:44.000000Z

OpenAI’s new model is better at reasoning and, occasionally, deceiving

The Verge - Artificial Intelligences 2024-09-17T20:17:50.000000Z

Copyright © 2019 FISHAI.All Rights Reserved