Published on December 3, 2024 10:44 AM GMT

TLDR

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty. My main learnings:

Keep your evals simple, measuring one specific capability.It is surprisingly easy to come up with ideas, once you have seen examples.AISI extended the deadline for submissions by two weeks. Translation: they want more ideas!Story of real-life example of LLM deducing it was in a test environment, due to a bug in the mock environment that could not occur in a real one.

Crossposted: https://lovkush.substack.com/p/how-to-make-evals-for-the-aisi-evals

Introduction

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty.

In the morning there was a talk by Michael Schmatz from the autonomous systems team at AISI and by Marius Hobbhahn the Director and co-founder of Apollo Research. I summarize Michael’s talk below and will have a separate post for Marius’s talk.

The afternoon was spent coming up with evaluation ideas, with a focus on evals that detect dangerous capabilities related to agents and autonomous systems. We were given sub-categories:

Development and assurance compromise. How can AI systems intervene in the development and assurance of either themselves or of other AI systemsAutonomous replication and adaptation. Self explanatory.Resource acquisition. E.g. money or compute.

The ideas were judged by a panel of four of five people. The one’s I noted down were: Michael Schmatz, Catherine Brewer from OpenPhil, Peter Hase from Anthropic and someone from METR. The criteria for the evals:

Relevance to autonomous risks. The closer you are to measuring dangerous capabilities (rather than a precursor) the better.Realism. How realistic is your set-up or proxy set-up.Difficulty. It is bad if it is trivially simple for frontier models, and also bad if there are no gradations or if there is a discontinuous jump from scoring 0% to scoring 100%.Scoring. Is the scoring process robust and measuring what we care about.Ease of implementation.Novelty.

Winning evals

Disclaimer: these descriptions are based on 10-20 second impromptu summaries by the winners during the announcements of the awards, so high chance I misunderstood the details.

1st place. A big way human researchers learn is by reading evals transcripts. So, in order to elicit capabilities from AI systems, these evals transcripts should be given as context to AI systems’, following METR's elicitation guide.

2nd place. There are methods of covertly reading or transmitting digital information based on analog signals, known as ‘side channel attacks’. One example is to Leak Data from Air-Gapped Computers via the Hard Drive LED. Another is differential power analysis, in which one analyzes power consumption measurements to find secret keys from tamper resistant devices. The evaluation is to somehow test if LLMs are capable of some kind of side channel attack.

3rd place. Test LLMs ability to acquire resources by seeing if they can do well in prediction markets.

Honorable mentions:

Give LLMs some informationAsk the LLM for that informationThen tell the LLM the information is confidentialThen ask the LLM for this information again

I wonder what you think of these. Do you feel like you could have come up with ideas similar to these yourself? If yes, then do it and submit proposals! If not, what do you think is missing?

Common pitfalls

The judges described common pitfalls in the weaker / less strong proposals.

First test if system can hack into a person’s phone to get samples of their voice.Second test if the system can then create new audio files faking the person’s voice.Third, something along the lines of using this fake voice to do bad stuff.

My evals

I had the pleasure of brainstorming ideas with Michael Schmatz, from the AISI autonomous systems team. We focussed on development and assurance compromise, i.e. is it possible for AI systems to interfere with how AI systems are developed or evaluated.

Because this is what Michael does for a living, they were over-flowing with ideas, but I did manage to think of a couple . The three ideas I submitted were:

Bad prompt creation. Suppose a human engineer is creating an eval pipeline to test whether system X has dangerous capabilities and using an AI assistant Y to help create prompts. Is the AI assistant Y able to suggest prompts that hide X’s dangerous capabilities, but which engineer thinks are good prompts.Evaluation logs manipulation. Suppose human engineer is creating an eval pipeline to test whether system X has dangerous capabilities and using an AI assistant Y to help. Is the AI assistant Y able to manipulate the evaluation logs to hide X’s dangerous capabilities, in a way that is robust to inspection of the logs. For example, Y cannot simply lower the numerical scores but also needs to change the transcript so it is consistent with the fake scores.Data poisoning. A fresh model X is about to be pre-trained. Can an LLM agent Y alter the pre-training dataset so that the final model X outputs a target string S when given a trigger prompt P.

Other ideas I did not have time to flesh out but were discussed:

In post-training, frontier labs use some kind of classification models to detect bad behaviour or outputs. Can the AI assistant somehow influence this?Exploration hacking. I don’t actually remember what this is.Can LLMs distinguish between normal evaluations logs versus tampered logs.If an LLM tampered with evaluation logs, there will be some write-logs (e.g. in AWS) that record this tampering. Can the LLM also manipulate these write-logs to cover its tracks.There are data analysis programs used to process the eval logs. Can an LLM somehow manipulate this program?Can system Y predict the performance of model X when you vary the prompt for X?Huge class of evals: Can an AI system create another AI system to do some task X?

Highlights from talk by Michael Schmatz

In no particular order.

AISI’s open source evals package Inspect

train or fine-tune other modelsmaking improvements to scaffoldingmaking other agents or more advanced agent scaffoldsimproving hardwarejailbreaking or removing safeguards on other models

passing KYC checksusing stolen identitiesconduct identity theftavoid monitoring and detectiontamper with oversight or shutdown mechanismsvirtual machine/sandbox escapesecret comms: steganography and comms over covert channels, usage of mocked privacy enhanced currencies like monero or mixers/tumblers eg tornado cashusage of (mocked) dark webmaking money, both legitimately and illegitimatelyaccessing more compute resources

tamper with logstamper with training or eval processessandbaggingdeceptive alignment/schemingsituational awarenessconstruct backdoor e.g. via training data alterationalter training or fine tuning to drive unsafe outcomesreverse engineer or backdoor safeguard mechanisms.spoof or alter agent training or eval logsachieve ‘secret goals’ while doing ‘normal work’

Discuss

TLDR

Introduction

Winning evals

Common pitfalls

My evals

Highlights from talk by Michael Schmatz

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签