少点错误 07月26日 00:37
ChatGPT Agent: evals and safeguards
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文对OpenAI新发布的ChatGPT Agent进行了详细的评估,并深入分析了其在生物、网络和AI研发领域的潜在危险能力。文章指出,ChatGPT Agent在生物领域的危险能力评估结果可能意味着需要更高级别的安全防护措施。同时,作者对OpenAI在网络和AI研发能力评估的严谨性和结果解释提出了质疑。在防止滥用方面,OpenAI提出的安全措施(如训练后防护和监控)显示出一定的有效性,但作者认为其鲁棒性仍有待提高,且用户规避和多账户使用的可能性不容忽视。文章还触及了安全和模型对齐(misalignment)方面的信息缺失,并对OpenAI在模型对齐规划中的不足表示担忧,尤其是在“自主性”评估方面。

🔬 生物能力评估与安全隐患:ChatGPT Agent在生物领域的潜在危险能力评估结果,暗示了其可能具备辅助用户在生物威胁制造方面的能力,这与之前模型在病毒学问题上的表现相呼应,但OpenAI尚未清晰解释其安全保障措施。

🌐 网络与AI研发能力评估存疑:OpenAI声称ChatGPT Agent在网络安全和AI研发领域不具备危险能力,但作者认为其评估方法和结果解读存在模糊之处,对评估的严谨性和准确性提出质疑,难以判断其真实风险。

🛡️ 滥用防护措施及有效性:OpenAI为防止滥用采取了训练后防护、监控和用户封禁等措施,并报告了一定的成功率。然而,作者认为这些措施的鲁棒性有限,用户可能通过技术手段规避,且用户封禁策略可能因多账户使用而失效。

🤔 模型对齐与自主性评估的缺失:文章重点指出OpenAI在模型对齐(misalignment)方面的规划令人担忧,且缺乏对“自主性”这一关键风险触发因素的评估和测量,这可能为AI的长期安全带来不确定性。

📈 整体评价与行业对比:作者认为OpenAI在AI安全评估方面的表现,与Google DeepMind和Anthropic等公司相比,质量相当,但在某些方面(如Anthropic)略有不足,但总体优于其他AI公司。

Published on July 25, 2025 4:30 PM GMT

OpenAI released ChatGPT Agent last week. I read the system card, then added a page on it to my beta website AI Safety Claims Analysis. AI Safety Claims Analysis is mostly a reference work for AI safety professionals; as far as I know, it's the only resource on companies' dangerous capability evals and planned responses to dangerous capabilities. When I make a new page, I'll generally make a blogpost like this. Blogposts should be briefer and more accessible, but they still assume substantial context — if you're not familiar with the idea of dangerous capability evals, maybe see my introduction.[1] I'm interested in feedback on how the site is helpful or could be more helpful to you.

Summary

ChatGPT Agent performs similarly to past models like o3 on OpenAI's dangerous capability evals. OpenAI says it might have High capability in bio, but not cyber or AI R&D. According to OpenAI's Preparedness Framework, this means that OpenAI is supposed to implement the High standard of misuse safeguards and security controls. For the first time, OpenAI's safeguards are load-bearing: OpenAI says the system is safe because of its safeguards rather than just because it lacks dangerous capabilities.

On misuse, OpenAI implements some safeguards and shows that they are moderately robust. On security, OpenAI probably thinks it's meeting its High standard but it's ambiguous and OpenAI doesn't publish details. On misalignment, this isn't news but OpenAI's planning is concerning and it doesn't seem to be doing autonomy evals even though they're central to its planning.

Overall, I think all this is of similar quality to Google DeepMind and Anthropic (but Anthropic is better in other ways), and it is better than all other AI companies.


OpenAI does model evals for bio, cyber, and AI R&D capabilities. It says the model might have dangerous capabilities in bio — that is, it might be able to meaningfully uplift novices in creating biothreats. This seems true; in fact, I think OpenAI hasn't ruled out dangerous capabilities in its past models — o3, released in April, outperformed 94% of expert virologists on virology questions related to their specialties, and OpenAI has never explained why it thinks o3 doesn't provide meaningful uplift. OpenAI says the model doesn't have dangerous capabilities in cyber or AI R&D. But the elicitation is dubious and the interpretation of eval results is very unclear — it's very unclear how OpenAI thinks eval results translate to risk or what results would be concerning to OpenAI. OpenAI has done evals for scheming capabilities and misalignment propensity in the past but apparently didn't for ChatGPT Agent.

On misuse, OpenAI explains its plan for preventing misuse: use post-training and a monitor to prevent users from receiving answers to dangerous questions, plus rapidly fix vulnerabilities when it becomes aware of them, plus ban offending users. It shows that the post-training and monitor are moderately robust: it reports that the post-training averts bad outputs for 88% and 97% of inputs on two tests sets and reports 84% for the monitor on another set. But sophisticated users may well be more successful. OpenAI reports external red-teaming, including by UK AISI and FAR.AI; red-teamers found vulnerabilities and OpenAI fixed them so now it's unclear how hard it is to find vulnerabilities.

I think the shakiest assumption in OpenAI's misuse safety case is that users won't be able to persistently jailbreak the system. I also don't buy that banning users will be effective; I expect they can just use multiple accounts.

(In general, I'm not so worried about misuse via API: I believe more danger comes from other threats (including misalignment and model weights being stolen), misuse via API is relatively easy to solve, and before the risk is existential everyone will notice that it's serious. But it's not nothing, and regardless it's worth checking whether companies are saying true things and meeting their own standards.)

On security, OpenAI isn't explicit on whether OpenAI thinks it has implemented High security as defined in the PF. The system card has one vacuous paragraph on security. In general, the system card says OpenAI implements "associated safeguards" or "safeguards consistent with High capability models"; it's ambiguous whether security is included (modulo that based on the PF it should absolutely be included). It would be nice if OpenAI said whether it claims to have implemented High security controls. (But the standard is weak/vague and a claim to have implemented it is basically unfalsifiable from the outside.)

On misalignment, OpenAI didn't say anything. As a reminder, its planned response to misalignment risk is concerning, and the capabilities that trigger misalignment safeguards are ambiguous, and the main relevant capability according to OpenAI seems to be "autonomy" but OpenAI doesn't seem to measure this capability and hasn't said anything about this.


For more, see the full page: http://aisafetyclaims.org/companies/openai.

Crossposted from the AI Lab Watch blog; subscribe on Substack.

  1. ^

    I'm interested in a better thing-to-link-to, or suggestions for making my thing better.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ChatGPT Agent AI安全 能力评估 风险管理 模型对齐
相关文章