少点错误 2024年11月28日
ARENA 4.0 Impact Report
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ARENA 4.0计划旨在培养人工智能安全领域的人才,通过为参与者提供机器学习技能培训、融入人工智能安全社区以及加速职业转型,促进人工智能安全领域的发展。该报告评估了ARENA 4.0计划的成果,发现其在参与者质量、技能提升、社区融入和职业发展等方面均取得了成功,参与者对课程的满意度极高,并对未来职业发展充满信心。报告还分析了参与者的背景、学习成果以及计划的改进方向,为未来ARENA计划的开展提供了宝贵的经验和建议。

🤔**参与者质量高**: ARENA 4.0吸引了来自全球各地的33名参与者,其中10人拥有4年以上软件工程经验,4人拥有博士学位或更高学历,确保了参与者的高素质和专业能力。

🚀**技能提升显著**: 参与者在为期4周的课程中学习了深度学习基础、机制可解释性、大型语言模型评估和强化学习等内容,并完成了具有挑战性的项目,估计在没有课程的情况下,学习相同内容需要10.2周。

🤝**社区融入成功**: 参与者高度评价了在LISA环境中与志同道合的同行交流的机会,这突显了在人工智能安全中心举办线下课程的重要性,许多参与者表示对独立开展机器学习安全研究充满信心。

💼**职业发展加速**: 4名参与者在课程结束后获得了人工智能安全领域的全职工作,另有24名参与者计划或正在积极申请相关职位,表明该计划有效地帮助参与者实现了职业转型。

💡**课程改进建议**: 缩短项目启动时间以及增加参与者数量导致了人才流失,未来应增加项目启动前的准备时间,并优化申请流程,吸引更多优秀人才。

Published on November 27, 2024 8:51 PM GMT

If you're interested in helping to run the ARENA program, note that we're currently hiring for an operations lead! For more details, and to apply, see here.

Summary

The purpose of this report is to evaluate ARENA 4.0’s impact according to our four success criteria:

    Source high-quality participantsUpskill these talented participants in ML skills for AI safety workIntegrate participants with the existing AI safety community and legitimise AI safety as a compelling field to work inAccelerate participants’ career transition into AI safety

Overall, this iteration of ARENA was successful according to our success criteria.

    Criteria 1: Our participants were of high calibre, with ten having more than 4 years of experience as software engineers at top companies and four having a PhD-level academic qualification or higher.Criteria 2: Our in-person programme lasts 4 weeks. The majority of participants felt they achieved the challenging concrete learning goals for each topic. Participants estimated the counterfactual time needed to learn the full ARENA content outside the programme as 10.2 weeks (even if they had access to ARENA’s material). We were particularly impressed with the capstone projects completed in the programme’s final week, two of which can be found here: capstone 1, capstone 2. We expect to see even more great work in the future!Criteria 3: Participants rated the value of being in the LISA environment as 8.9/10. The top cited “most valuable gain” from the programme was meeting talented and like-minded peers in the AI safety community, emphasising the importance of running the programme in person at an AI safety hub. We are also glad that many participants commented on now feeling confident to take on ML safety research projects independently.Criteria 4: Four participants had full-time AI safety roles at the end of the programme, and a further 24/33 participants planned to or were actively applying to AI safety roles at the end of the programme.

The structure of the report is as follows:

Programme Information

First, we outline when this programme occurred, what topics were covered, and the main changes made to the programme in contrast to previous iterations. For more information about our curriculum content, see our website.

ARENA 4.0 Programme

ARENA 4.0 ran from the 2nd of September until the 4th of October 2024. The schedule of the programme was as follows:

Main Changes

The main changes for ARENA 4.0 compared with ARENA 3.0 (ran in Q1 2024) were:

Method

We surveyed our participants at the programme's start (on day 1) and at the end (on the last day). Our impact analysis is mainly based on responses at the end of the programme, while some are based on comparing the changes in responses between the start and the end.

We collected three types of responses:

We evaluated open-ended responses using thematic analysis. We highlighted keywords in each response, identified recurring themes and patterns across responses, reviewed the themes, and then counted the frequency of each theme across participant responses. Each count comes from a different participant, but each participant can add to multiple theme counts if their response mentions them.

Criteria 1: Sourcing high-quality participants[1]

Overall, our selection procedure worked effectively. ARENA 4.0 had a geographically diverse cohort of high quality participants, with 10 participants having more than four years experience in software engineering, and four having or conducting a PhD.

Selection process

Initial applications for ARENA opened on the 6th of July 2024 and closed on the 20th of July. The coding test ran from the 25th of July until the 29th of July (with an extension until the 1st of August). Interviews ran from the 4th of August until the 7th of August.

Who we selected

We selected 33 participants from ~350 applications. Participants came from across the world, including the US, UK, EU, India, and South Africa. The current jobs of participants can be seen in Figure 1.

Figure 1: Current careers of participants (at the start of the programme)[2]

As shown in Figure 2, the highest academic degree for 55% of participants was a Bachelor’s degree, for 34% a Master’s degree, and for 13% a PhD.

Figure 2: Highest academic degree of participants

Improvements

The short notice period (2 months from announcement to start of programme, with only 1 month for selection) needed to be increased. This likely caused us to lose out on strong talent. Several participants gave late applications we accepted after deadlines, and several participants mentioned that the application timeline felt very tight. Future iterations will benefit from a more consistent schedule with longer lead times.

Criteria 2: Upskilling

As an ML program, our core goal is to upskill participants to tackle technical problems in AI safety. The first four weeks of the ARENA in-person programme cover four technical topics (more detail on each topic is provided in the relevant sections):

Each topic lasts a week. We asked participants to rate these weeks according to three criteria at the end of the programme:

Overall, participants found exercises both challenging and enjoyable. Participants rated ARENA exercises at 8.5/10 for enjoyment and 7.5/10 for difficulty. The total time participants estimated it would’ve taken them to learn the materials outside the programme (without TAs, lectures, pair-programming) was 10.2 weeks. This shows ARENA is successfully achieving its upskilling goals and emphasises the importance of selection, as even our high-calibre participants found the exercises quite challenging.

Week 0: Fundamentals

The aim of this week is for participants to reinforce basic deep-learning concepts. This week had 25 participants, as it was optional for those with significant deep-learning experience. Topics covered include PyTorch, basics of neural networks, residual neural networks, CNNs, weights and biases, optimisation, and backpropagation.

Participants said, on average, that it would take 2.3 weeks to learn the deep learning fundamentals week’s materials on their own if they had access to ARENA materials, as seen in Figure 3 below.

Figure 3: Counterfactual learning time of fundamentals week

Week 1: Mechanistic Interpretability

The aim of this week is for participants to understand some of the methods that can be used to analyse model internals and replicate the results from key interpretability papers. Topics covered include the following: GPT models, training and sampling from transformers, TransformerLens, induction heads, indirect object identification, superposition, linear probes, inference-time intervention, and sparse autoencoders.

We asked participants to rate out of 7 their confidence in doing three concrete tasks in mechanistic interpretability (Figure 4); on average, participants rated their ability in these concrete outcomes at 5.64. Participants said, on average, that it would take 3.7 weeks to learn the mechanistic interpretability week’s materials on their own, as seen in Figure 5.

Figure 4: Concrete learning outcomes for mechanistic interpretability
Figure 5: Counterfactual learning time of mechanistic interpretability week

Week 2: LLM Evaluations

A new week of content on LLM evaluations was developed for this iteration of ARENA. This content aimed for participants to build alignment and dangerous capability evaluations in multiple-choice and agentic settings, and understand how to use these evaluations to gain information about current frontier LLMs. We added this week because evals have become a crucial field of AI safety, and there are no high-quality educational materials on safety evals to our knowledge. Topics covered include the following: threat modeling, using LLM APIs, implementing a pipeline to generate questions using LLMs, UK AISI’s inspect library, implementing LLM agents, and scaffolding LLM agents.

Overall, participants achieved the concrete learning outcomes to a similar extent as the other weeks, which we see as a positive sign for the first time running this content! On average, participants rated their confidence in designing multiple-choice question evals and agent evals as 6.0/7 and said it would take 1.6 weeks to learn this content independently (see Figures 6 and 7).

Regarding learning experience, participants rated this week an average enjoyment of 4.9/7 and an average usefulness of 5.0/7 across the days (based on responses from 12 participants out of 33 on an LLM evals feedback form). Participants particularly enjoyed the content on agentic dangerous-capabilities evaluations and the conceptual thinking of day 1 on threat-modeling and eval design.

We identified several areas of improvement to the learning experience. The most commonly cited issue in enjoyment ratings was code bugs in the materials (although this is sometimes unavoidable when producing new content). Second, we identified a more valuable set of learning outcomes from running this content on participants (in particular, a reduced emphasis on manual prompting of LLMs and an increased emphasis on designing evals infrastructure and agentic evals). The materials would benefit from a set of changes to put more emphasis on these learning outcomes, including adding clarifications and exercises on key concepts, restructuring the content to achieve a better flow, and adding additional content to challenge participants further.

Figure 6: Concrete learning outcomes for LLM evaluations week
Figure 7: Counterfactual learning time of LLM evaluations week

Week 3: Reinforcement Learning

This week's core aim is for participants to understand classical and deep RL methods and how RLHF is implemented on LLMs as the dominant alignment method used today. Topics covered include the following: Fundamentals of RL, gym & gymnasium environments, policy gradient optimisation, PPO, deep Q-learning, RLHF, HuggingFace, and fine-tuning LLMs.

We asked participants to rate out of 7 their confidence in doing four concrete tasks in RL; the results can be seen in Figure 8. Participants rated their ability in these concrete learning outcomes at 5.44 on average. We asked participants to rate out of 7 their confidence in doing the following concrete tasks in RL. Participants said, on average, that it would take 3.2 weeks to learn the RL week’s materials on their own.

Figure 8: Concrete learning outcomes for RL week
Figure 9: Counterfactual learning time for RL week

Overall Learning Experience

Finally, we asked participants how they found the ARENA materials overall. This helps us calibrate participant calibre across different ARENA cohorts and elicit feedback on the quality of our teaching mechanisms. On average, participants rated 8.5 out of 10 for exercise enjoyment, 7.6 out of 10 for exercise difficulty, and 8.3 out of 10 for teaching quality.

Figure 10: Overall exercise enjoyment across all weeks
Figure 11: Overall exercise difficulty across all weeks
Figure 12: TA and lecture rating

Criteria 3: Integration

Our participants spent 4 to 5 weeks full-time in the LISA office in London. Overall, they enjoyed their time there! Participants rated on average 8.9/10 for “How valuable was being in the LISA environment” (see Figure 13).  

The most cited value from participants’ open-ended feedback was feeling that they were part of the AI safety community in ways they had not been before (see Figure 14). We were particularly excited to see a few outcomes mentioned by participants: (1) that they met a group of like-minded peers who are at similar career stages and who can help/motivate each other to progress further, (2) that they met senior researchers who helped give future directions and legitimise the field. We see these as two of the most impactful values of ARENA. They underline the importance of running the program in person in an AI safety hub like LISA. We are incredibly glad to be able to make use of — and contribute to — LISA’s thriving community of AI safety researchers with the ARENA programme!

Figure 13: Participants’ rating of the value of the LISA environment
Figure 14: Analysis of participants’ comments on what was particularly valuable about LISA

Here are some representative quotes that illustrate these themes:

Connections to/feeling like a part of the AI safety community

Access to top researchers

Meeting like-minded talented people

Motivation to learn and get unstuck

Immediate access to TAs

Criteria 4: Career Acceleration

Finally, ARENA aims to accelerate participants' AI safety careers. We’re excited about the career outcomes for this cohort. Most participants (~70%) are actively pursuing AI safety careers, either in the planning or application stages (see Figure 16). Even more exciting: four participants have already secured full-time AI safety positions starting within the next four months. These results suggest that ARENA is successfully achieving one of its core goals of providing talented individuals with the skills to go directly into AI safety work. We’re especially encouraged to see participants moving directly into impactful roles so quickly after the conclusion of ARENA.

We also saw a difference in participants’ confidence in AI safety being the right field for them. At the start of the programme, participants rated on average of 7.7 out of 10 for “How confident are you that AI safety is the right field for you?” and 8.2 out of 10 by the end of the programme. The between-person difference across the programme was, on average, +0.6 (see Figure 15). This demonstrates the impact that ARENA has in increasing people’s confidence on whether AI safety is a good personal fit for them.

Two participants had reduced their certainty that AI safety is the right field for them. Still, it is also positive if ARENA can help participants who want to pursue a career transition test their fit for alignment engineering in a comparatively low-cost way. We expect a few to decrease their confidence, to validate that ARENA’s materials are technical and challenging enough to test for fit.

Figure 15: Participants’ AI safety career confidence
Figure 16: Participants’ current AI safety career situation (end of programme)

Overall Programme Experience

We asked the participants, “How much did you enjoy the programme overall?” at the end. The average participant rating was 9.11 out of 10.

Figure 17: Overall programme enjoyment

Most valuable gain

We asked participants, “What was the most valuable thing you gained from the programme?” and thematically analysed their open-ended responses. We identified the following common themes.

Figure 18: Analysis of what participants found most valuable about ARENA

Notably, the most frequently mentioned “most valuable gain” was meeting talented and like-minded people. This reflects the value of LISA mentioned in the previous section and underlines the importance of running ARENA in person to achieve our third success criteria of integrating participants within the AI safety ecosystem. Furthermore, we are enthused that the third most commonly cited valuable gain was “confidence to take on ML work” — this was one of the principal outcomes that we aimed for under our upskilling goal — the programme is only 5-weeks, so an impactful upskilling outcomes is if participants gain confidence to be able to bootstrap themselves to learn at a faster rate in the future.

Here are some representative responses for each of our core themes:

Meeting talented and like-minded people

ML skills and knowledge

Confidence to take on AI safety work

Many responses specifically mentioned a boost in their confidence in undertaking AI safety work. We’ve separated this out from references of particular AI safety knowledge and experience (“ML skills and knowledge”).

Ability to quickly execute on a project

Capstone Project

Improvements

As a team, we endeavour to use feedback to improve the quality of ARENA for participants. Each iteration, we learn how to better run the programme so that its impact can grow for all of our participants. Although this programme was overall successful according to its four success criteria, we noticed some core improvements that would enable it to run even better. The key improvements we noticed in this iteration are:

Acknowledgments

This report was produced by @Chloe Li (data analysis, writing), @JamesH (writing), and @James Fox (reviewing, editing) at ARENA. We thank @CallumMcDougall for his comments on the final draft. We also thank Open Philanthropy for their generous support of the ARENA program.

  1. ^

     Note: Some details of this section have been redacted, in order that key details and aspects of how we select and choose participants remain private to avoid potential issues that may arise in our selection process in the future.

  2. ^

     Note: “Conducting alignment research” only includes those who are currently working full-time on alignment research (independently, as mentee/intern, or employed), not those who have in the past or are working part-time on alignment. This was not self-reported by the participants, but annotated by us based on their CV, so there may be some inaccuracies.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ARENA 人工智能安全 机器学习 人才培养 职业发展
相关文章