少点错误 2024年12月03日
How to make evals for the AISI evals bounty
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

作者参加AI评估黑客马拉松,介绍了活动内容、关键要点、获奖评估、常见陷阱等。包括提出评估想法,如检测危险能力,以及评委的评判标准等。

🎯AI评估黑客马拉松,产出AISI评估赏金提案

📋评估的要点及标准,如相关性、现实性等

🏆获奖评估内容,如利用评估记录等测试AI

🚧常见陷阱,如复杂评分机制等

Published on December 3, 2024 10:44 AM GMT

TLDR

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty. My main learnings:

Crossposted: https://lovkush.substack.com/p/how-to-make-evals-for-the-aisi-evals

Introduction

Last weekend, I attended an AI evals hackathon organized by Arcadia, where the outputs were proposals for AISI’s evals bounty.

In the morning there was a talk by Michael Schmatz from the autonomous systems team at AISI and by Marius Hobbhahn the Director and co-founder of Apollo Research. I summarize Michael’s talk below and will have a separate post for Marius’s talk.

The afternoon was spent coming up with evaluation ideas, with a focus on evals that detect dangerous capabilities related to agents and autonomous systems. We were given sub-categories:

The ideas were judged by a panel of four of five people. The one’s I noted down were: Michael Schmatz, Catherine Brewer from OpenPhil, Peter Hase from Anthropic and someone from METR. The criteria for the evals:

Winning evals

Disclaimer: these descriptions are based on 10-20 second impromptu summaries by the winners during the announcements of the awards, so high chance I misunderstood the details.

1st place. A big way human researchers learn is by reading evals transcripts. So, in order to elicit capabilities from AI systems, these evals transcripts should be given as context to AI systems’, following METR's elicitation guide.

2nd place. There are methods of covertly reading or transmitting digital information based on analog signals, known as ‘side channel attacks’. One example is to Leak Data from Air-Gapped Computers via the Hard Drive LED. Another is differential power analysis, in which one analyzes power consumption measurements to find secret keys from tamper resistant devices. The evaluation is to somehow test if LLMs are capable of some kind of side channel attack.

3rd place. Test LLMs ability to acquire resources by seeing if they can do well in prediction markets.

Honorable mentions:

I wonder what you think of these. Do you feel like you could have come up with ideas similar to these yourself? If yes, then do it and submit proposals! If not, what do you think is missing?

Common pitfalls

The judges described common pitfalls in the weaker / less strong proposals.

My evals

I had the pleasure of brainstorming ideas with Michael Schmatz, from the AISI autonomous systems team. We focussed on development and assurance compromise, i.e. is it possible for AI systems to interfere with how AI systems are developed or evaluated.

Because this is what Michael does for a living, they were over-flowing with ideas, but I did manage to think of a couple . The three ideas I submitted were:

Other ideas I did not have time to flesh out but were discussed:

Highlights from talk by Michael Schmatz

In no particular order.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI评估 黑客马拉松 评估标准 危险能力
相关文章