Published on July 31, 2024 7:30 PM GMT
Frontier AI labs can boost external safety researchers by
- Sharing better access to powerful models (early access, fine-tuning, helpful-only,[1] filters/moderation-off, logprobs, activations)[2]Releasing research artifacts besides modelsPublishing (transparent, reproducible) safety researchGiving API creditsMentoring
Here's what the labs have done. Let me know if anything is missing/wrong.
Anthropic:
- Giving helpful-only access to some researchers developing evalsGiving free API access to some OP grantees and giving some researchers $1K in API credits(Giving deep model access to Ryan Greenblatt)(External mentoring, in particular via MATS)[No fine-tuning or deep access, except for Ryan]
Google DeepMind:
- Sharing some of their model evals for dangerous capabilitiesReleasing Gemma SAEsReleasing Gemma weightsEmbeddings API(External mentoring, in particular via MATS)(Grants programs — details private)[No fine-tuning or deep access to frontier models]
OpenAI:
- Maybe giving better API access to some OP granteesFine-tuning GPT-3.5 (and "GPT-4 fine-tuning is in experimental access"; OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023)Early access: shared GPT-4 with a few safety researchers including Rachel Freedman before releaseAPI gives top 5 logprobsEmbeddings API
Meta AI:
- Releasing Llama weights
Microsoft:
- [Nothing]
Somewhat-related papers:
- Structured access for third-party research on frontier AI models (Bucknall and Trager 2024)Black-Box Access is Insufficient for Rigorous AI Audits (Casper et al. 2024)
- (The paper is about audits, like for risk assessment and oversight; this post is about research)
- ^
"Helpful-only" refers to the version of the model RLHFed/RLAIFed/finetuned/whatever for helpfulness but not harmlessness.
- ^
Releasing model weights will likely be dangerous once models are more powerful, but all past releases seem fine, but e.g. Meta's poor risk assessment and lack of a plan to make release decisions conditional on risk assessment is concerning.
Discuss