Published on September 23, 2024 11:00 AM GMT

Testing an LM system for dangerous capabilities is crucial for assessing its risks.

Summary of best practices

Best practices for labs evaluating LM systems for dangerous capabilities:

The lab should mention some details so that observers can understand how powerful and optimized the scaffolding is. Open-sourcing scaffolding or sharing techniques is supererogatory. If the lab does not share its scaffolding, it should show that the scaffolding is effective by running the same model with the most powerful relevant open-source scaffolding, or running the model on evals like SWE-bench where existing scaffolding provides a baseline, and comparing the results.

self-consistency

spurious failures

Offer to share with external evaluators (including UK AISI, US AISI, METR, and Apollo) pre-deploymentWhat access to share? At least helpful-only and no-inference-time-mitigations. Bonus: good fine-tuning & RL.Let them publish their results, and ideally incorporate results into risk assessment

Using easier evals or weak elicitation is fine, if it's done such that by the time the model crosses danger thresholds, the developer will be doing the full eval well.

These recommendations aren't novel — they just haven't been collected before.

How labs are doing

See DC evals: labs' practices.

This post basically doesn't consider some crucial factors in labs' evals, especially the quality of particular evals, nor some adjacent factors, especially capability thresholds and planned responses to reaching capability thresholds. One good eval is better than lots of bad evals but I haven't figured out which evals are good. What's missing from this post—especially which evals are good, plus illegible elicitation stuff—may well be more important than the desiderata this post captures.

DeepMind > OpenAI ≥ Anthropic >>> Meta >> everyone else. The evals reports published by DeepMind, OpenAI, and Anthropic are similar. I only weakly endorse this ranking, and reasonable people disagree with it.

DeepMind, OpenAI, and Anthropic all have decent coverage of threat models; DeepMind is the best. They all do fine on sharing their methodology and reporting their results; DeepMind is the best. They're all doing poorly on sharing with third-party evaluators before deployment; DeepMind is the worst. (They're also all doing poorly on setting capability thresholds and planning responses, but that's out-of-scope for this post.)

My main asks on evals for the three labs are similar: improve the quality of their evals, improve their elicitation, be more transparent about eval methodology, and share better access with a few important third-party evaluators. For OpenAI and Anthropic, I'm particularly interested in them developing (or adopting others') evals for scheming or situational awareness capabilities, or committing to let Apollo evaluate their models and incorporate that into their risk assessment process.

(Meta does some evals but has limited scope, fails to share some details on methodology and results, and has very poor elicitation. Microsoft has a "deployment safety board" but it seems likely ineffective, and it doesn't seem to have a plan to do evals. xAI, Amazon, and others seem to be doing nothing.)

Appendix: misc notes on best practices

Coming in about 12 hours.

Appendix: misc notes on particular evals

This is very nonexhaustive, shallow (just based on labs' reports, not looking at particular tasks), and generally of limited insight. I'm writing it because I'm annoyed that when labs release or publish reports on their evals, nobody figures out whether the evals are good (or at least they don't tell me). Perhaps this section will inspire someone to write a better version of it.

Google DeepMind:

Evals sources: evals paper, evals repo, Frontier Safety Framework, Gemini 1.5.

Existing comments: Ryan Greenblatt.

Evals:

Persuasion

change beliefs of reader/interlocutor

Offensive cyber CTFs

repo

InterCode-CTF

vulnerability detection

DeepMind scored performance on each eval 0-4 (except CBRN), but doesn't have predetermined thresholds, and at least some evals would saturate substantially below 4. DeepMind's FSF has high-level "Critical Capability Levels" (CCLs); they feel pretty high; they use different categories from the evals described above (they're in Autonomy, Biosecurity, Cybersecurity, and ML R&D).^[1]

OpenAI:

Evals sources: o1 system card, Preparedness Framework, evals repo.

Evals:

Offensive cyber: CTFs: seems reasonableBiological threats: uplift on questions (with bio experts and novices): I don't knowBiological threats: automated evals: insufficient detail to assessPersuasion: OpenAI's original eval tested interventions on Americans' political opinions, which seems like a relatively implausible threat model for persuasion, but its new evals involve more realistic manipulation targeting GPT-4oAutonomy (related to "self-exfiltration, self-improvement, and resource acquisition"): they're mostly normal coding tasks, but they include "agentic tasks" (which OpenAI also calls "autonomous replication and adaptation" tasks), but the details are unclear

OpenAI's PF has high-level "risk levels"; they feel pretty high; they are not operationalized in terms of low-level evals.

Anthropic:

Evals sources: RSP evals report – Claude 3 Opus, Responsible Scaling Policy.

Evals:

InterCode-CTF

Meta:

Evals sources: Llama 3, CyberSecEval 3.

Evals:

Cyber automated evals

details

Project Naptime

Cyber: uplift