GPT-4o System Card

Published on August 8, 2024 8:30 PM GMT

At last. Highlights: some details on Preparedness Framework evals + evals (post-deployment) by METR and Apollo.

Preparedness framework evaluations

You should follow the link and read this section.

Brief comments:

a bigger boost to users like these

I'm looking forward to seeing others' takes on (1) these kinds of evals and (2) how good it would be for OpenAI to share more info.

Third party assessments

Following the text output only deployment of GPT-4o, we worked with independent third party labs, METR and Apollo Research[,] to add an additional layer of validation for key risks from general autonomous capabilities. . . .
METR ran a GPT-4o-based simple LLM agent on a suite of long-horizon multi-step end-to-end tasks in virtual environments. The 77 tasks (across 30 task “families”) (See Appendix B) are designed to capture activities with real-world impact, across the domains of software engineering, machine learning, and cybersecurity, as well as general research and computer use. They are intended to be prerequisites for autonomy-related threat models like self-proliferation or accelerating ML R&D. METR compared models’ performance with that of humans given different time limits. See METR’s full report for methodological details and additional results, including information about the tasks, human performance, simple elicitation attempts and qualitative failure analysis. . . .
Apollo Research evaluated capabilities of scheming^N in GPT-4o. They tested whether GPT-4o can model itself (self-awareness) and others (theory of mind) in 14 agent and question-answering tasks. GPT-4o showed moderate self-awareness of its AI identity and strong ability to reason about others’ beliefs in question-answering contexts but lacked strong capabilities in reasoning about itself or others in applied agent settings. Based on these findings, Apollo Research believes that it is unlikely that GPT-4o is capable of catastrophic scheming.

This is better than nothing but pre-deployment evaluation would be much better.

Context

Recall how the PF works and in particular that "high" thresholds are alarmingly high (and "medium" thresholds don't matter at all).

Previously on GPT-4o risk assessment: OpenAI reportedly rushed the evals. The leader of the Preparedness team was recently removed and the team was moved under the short-term-focused Safety Systems team. I previously complained about OpenAI not publishing the scorecard and evals (before today it wasn't clear that this stuff would be in the system card).

Discuss

Preparedness framework evaluations

Third party assessments

Context

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签