少点错误 07月23日 23:32
Ten AI safety projects I'd like people to work on
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者对未来十年内AI系统可能带来的灾难性风险表示担忧,并基于两年来的工作经验,提出了九个值得关注和投入的AI安全领域项目。这些项目涵盖了AI安全领域建设、技术治理研究、AI行为追踪、传播咨询、实验室监测、文献综述、经济影响追踪以及AI审计等多个方面。作者强调了这些项目在降低AI风险、推动负责任AI发展以及提升全社会对AI风险的认知方面的重要性,并为每个项目设想了初步的实施路径。

🌐 **AI安全领域建设:** 旨在通过设计和运行项目,吸引具有安全工程经验的专业人士,并对他们进行与转型AI相关的安全挑战培训,如保护模型权重、防止数据污染和抵御数据泄露等。此举旨在解决AI安全领域人才瓶颈问题,确保强大AI系统不落入不法之徒手中。

🔬 **技术AI治理研究组织:** 建议成立一个专注于技术AI治理的研究机构,深入探讨如“如何在实践中执行计算治理机制”或“如何实现可验证的模型审计而不损害模型安全”等关键问题。该组织还将设立研究员项目,为早期和中期技术人才提供探索AI治理项目的机会,并建立行业联系。

🔍 **追踪“在野”AI代理行为:** 提议建立一个组织,系统性地调查已部署AI代理是否存在失职、欺骗或不当行为的迹象。这包括分析交互日志、创建“诱捕”环境、访谈AI代理的重度用户,并发布详细的案例研究,以提高对AI风险的认识和应对能力。

✍️ **AI安全传播咨询:** 建议成立一家专业的传播机构,帮助AI安全领域的组织更有效地沟通其工作。该机构将雇佣具备AI领域知识的传播专业人士,提供媒体培训、写作支持、战略规划等服务,以提升AI风险的公众认知和政策讨论的质量。

📊 **AI实验室监测:** 呼吁成立独立组织,对前沿AI实验室的安全相关实践和政策进行细致分析。通过追踪安全测试程序、审查负责任的扩展政策、评估实验室应对快速AI进展的计划等,为政策制定者和AI安全社区提供可靠信息,促进实验室的改进和问责。

📚 **AI安全动态文献综述:** 提议创建针对核心AI安全主题的、由专家维护并持续更新的文献集合。这些综述将汇集最新的研究、博客和讨论,为政策制定者、资助者和研究人员提供关于AI系统欺骗证据、研究议程、政策想法等方面的清晰图景。

💰 **100亿美元AI韧性计划:** 构思一个详细的实施计划,说明如何部署100亿美元资金以显著推进AI对齐/控制研究和/或社会韧性建设。该计划将包含具体的项目结构、优先资助领域、预算分配和时间表,以应对潜在的重大AI风险。

✅ **AI事实核查工具:** 建议建立一个(可能是盈利性的)组织,开发透明且无偏见的AI驱动的事实核查工具,例如实时核查声明的浏览器扩展。该组织将进行严格的偏见评估,发表研究论文,并维护公共数据集,以提升社会信息质量。

⚖️ **AI审计员:** 提议成立一家公司,构建能够进行合规审计的AI代理。这些系统旨在自动化人工审计的繁琐任务,如审查文档和日志、检查安全程序实施情况等。尽管存在安全风险,但AI审计员有望实现更频繁、更全面的合规检查。

Published on July 23, 2025 3:28 PM GMT

Cross-posting this new blog post I wrote. Apologies for the trite intro — I'm trying to write the blog in a pretty accessible way.

Here's the Twitter thread.

Finally, if you’re interested in applying for funding to work on one of these projects (or something similar), check out our RFP.


I’ve said it before, and I’ll say it again: I think there’s a real chance that AI systems capable of causing a catastrophe (including to the point of causing human extinction) are developed in the next decade. This is why I spend my days making grants to talented people working on projects that could reduce catastrophic risks from transformative AI.

I don't have a spreadsheet where I can plug in grant details and get an estimate of basis points of catastrophic risk reduction (and if I did, I wouldn't trust the results). But over the last two years working in this role, I’ve at least developed some Intuitions™ about promising projects that I’d1 like to see more people work on.

Here they are.

1. AI security field-building

What: Design and run a field-building program that takes security engineers (with at least a few years of work experience) and teaches them about the types of security challenges most relevant to transformative AI: securing model weights and algorithmic insights from highly resourced adversaries, preventing data contamination, defending against exfiltration attacks, etc. The program would also give actionable advice on how to break into the field. This could look like a 6-week, part-time discussion group covering a curriculum of AI security readings. It could also involve some type of “buddy system” where participants are paired with mentors working on AI security (a la MATS).

Why this matters: It would be unfortunate if powerful AIs were stolen by bad guys. Keeping AI projects secure seems important for making transformative AI go well, and progress on this looks tractable. Though I understand that there are a few AI security talent development projects already in the works, I’ve heard that talent bottlenecks are a persistent problem (especially for security roles that aren’t at AI labs, as these positions offer much lower salaries).

What the first few months could look like: Designing and getting feedback on a curriculum (or finding someone to do this for you), figuring out the program structure, reaching out to potential mentors/guest speakers, and hiring an ops person.

2. Technical AI governance research organization

What: Start a research organization that primarily focuses on the kinds of technical AI governance topics outlined in this paper. It could also run a fellowship program (e.g., 3-6 months, with a part-time option for people who already have jobs) that allows early- to mid-career technical people to explore a technical AI governance project, build greater context on AI safety, and develop connections with others in the field. “Technical AI governance” is a pretty broad/vague term, so it’s probably worth narrowing down to a few complementary directions, at least to start. Here are a few from the Bucknall/Reuel et al. paper that I’m particularly excited about.

Why it matters: Rigorous technical analysis on questions like "how would you actually enforce [insert compute governance mechanism] in practice?" or "what techniques can allow for verifiable model auditing without compromising model security?" can be extraordinarily helpful for making policy proposals more feasible (and thus more likely to be implemented). My sense is that a lot of AI safety people have gotten excited about this paradigm in the last year or so (justifiably IMO), but there’s still more room for this kind of work. See Open Problems in Technical AI Governance for more. FAR AI also recently ran a workshop on this.

What the first few months could look like: Figuring out which subset of technical AI governance research areas to focus on (if you haven’t done this already), speaking to people working on those kinds of problems to get possible research project ideas, and hiring a few researchers.

3. Tracking sketchy AI agent behaviour “in the wild”

What: Start an organization to systematically investigate deployed AI agents for signs of misalignment, scheming, or general sketchy behaviour in the wild. This could involve a number of possible activities: (1) partnering with AI companies to analyze anonymized interaction logs for concerning behaviour patterns, (2) creating honeypot environments to see if AI agents attempt to gain unauthorized access or resources, (3) interviewing power users of AI agents (e.g., companies) to gather preliminary signals of situations where agents might be doing sketchy things, and (4) writing about case studies of deployed agents acting sycophantic, manipulative, deceptive, etc.

The organization could also publish detailed case studies of confirmed incidents and maintain a public database of problematic behaviours observed in deployed systems (though only ones relevant to misalignment, and not “AI harm” more broadly construed).

Why this matters: For a long time, folks worried about misalignment mostly on the basis of theoretical arguments (and occasionally some lab experiments with questionable ecological validity). Things have changed: LLMs are starting to exhibit increasingly sophisticated and concerning behaviour, such as attempting to prevent their preferences from being changed, systematically gaming their evaluation tasks, and aiming for high scores rather than actually solving the problems at hand. We should go a step further and try hard to check if these concerns are actually manifesting in real-world deployments (and if so, in what ways and at what scale). Thoughtful, rigorous, and real-world observational evidence about misalignment would be valuable for grounding policy discussions and improving the world's situational awareness about AI risk.

What the first few months could look like: Picking 1-2 workstreams to start with, speaking with people working on relevant topics (e.g., at AI companies) to understand challenges/opportunities, and learning more about how other OSINT projects work (to understand analogies and disanalogies).

4. AI safety communications consultancy

What: A dedicated communications firm specializing in helping organizations working on AI safety communicate more effectively about their work. The firm would employ communications professionals who can build (or already have) AI context. They'd provide a range of core communications support services — media training, writing support, strategic planning, helping to pitch op-eds to editors, interview prep, developing messaging frameworks, etc. Unlike general communications firms, this firm would invest in understanding the nuances of AI safety to provide more effective support (which IMO is really helpful). Importantly, at least one of the founders should have experience working at a communications firm — I love bright generalists, but this really demands some prior experience and connections.

Why this matters: Communicating clearly about AI risk is crucial for informing policy discussions and building public understanding, but many organizations working on narrow technical problems simply aren't suited to do this well. A researcher focused on a niche technical topic probably doesn’t know how to pitch an op-ed to a major publication or land a TV interview. Many organizations hire people to help with this, but in many instances, having in-house staff doesn’t make sense, and it’s hard to find consultants who understand these issues well. A new consultancy could help to solve this problem.

What the first few months could look like: Talking to potential clients to understand their needs and challenges, doing (or commissioning) research on messages/frames/messengers, and generally getting a bird’s-eye view of the AI safety ecosystem.

5. AI lab monitor

What: An independent organization (or even a single talented person if they’re really good) that conducts meticulous analysis of frontier AI labs' safety-relevant practices and policies. They'd track and analyze safety testing procedures and what safeguards labs actually implement, check whether responsible scaling policies are being followed (and whether there are ways they can be strengthened), critically examine labs' plans for navigating periods of rapid AI progress, dig into corporate governance structures, and generally document all safety-relevant decisions these companies are making. Alongside a constant stream of analysis on specific topics of interest, this effort might also produce semi-regular (e.g., quarterly) scorecards rating labs across key dimensions.

Why this matters: Running a frontier AI lab involves designing and implementing complicated safety practices that are constantly evolving. This is an enormous challenge, and the decisions made by AI labs today could be crucial for avoiding catastrophic outcomes over the next few years. When labs update their safety policies, modify security requirements, or change evaluation procedures, the implications can be subtle but significant (and sometimes these changes are buried in long technical reports). We need people whose bread and butter is obsessively tracking these details — and importantly, effectively communicating their findings in ways that lead to engagement by policymakers, AI lab employees, and the broader AI safety community. This kind of rigorous external analysis serves multiple purposes: it helps labs improve and stay accountable to their practices, it gives policymakers reliable information about how safety practices are being implemented, and it improves the public’s situational awareness about crucial decisions being made by the most important companies in the world.

What the first few months could look like: Developing a comprehensive tracking system for monitoring lab activities (RSS feeds for blog posts, paper releases, job postings, etc.), producing a few short-ish analyses to learn more about what potential consumers of this analysis find useful, building relationships with current/former lab employees, and establishing distribution channels for your analysis (Substack, Twitter presence, a newsletter, a website, maybe a database, relationships with journalists who cover AI, etc).

6. AI safety living literature reviews

What: A “living literature review” is a continuously updated collection of expert-authored articles on a specific topic. It would be great if there were a few of these for core AI safety topics. Questions like: What's the state of evidence for AI systems engaging in scheming? Which technical AI safety research agendas are people working on, and what progress has been made? What policy ideas have people floated for making AI safer (bucketed into categories of policy ideas)? Will there really be major job losses before transformative AI arrives? Each review would be maintained by a single expert (or small team) who synthesizes the memesphere into digestible, regularly updated articles.

Why this matters: The AI safety field would benefit from more high-quality synthesis. Every week brings new papers, blog posts, Twitter threads, and hot takes. Meanwhile, policymakers, funders, and even researchers themselves struggle to maintain a coherent picture of what is going on. Living literature reviews seem like an interesting and under-explored model for helping with this.

What the first few months could look like: Picking a single topic to focus on, writing some introductory material, reading up on anything where you have gaps in your knowledge, and then producing something that summarizes the current discourse on that topic. You'd also want to set up the basic infrastructure — a Substack for publishing, a simple website, a Twitter account for distribution, and a system for tracking new research and analysis (RSS feeds, Google Scholar alerts, maybe just following the right people on Twitter).

7. $10 billion AI resilience plan

What: A comprehensive, implementation-ready plan detailing how $10 billion (somewhat arbitrarily chosen) could be deployed to make significant progress toward AI alignment/control research and/or societal resilience measures. This would ideally be a fairly detailed blueprint with specific program structures, priority funding areas, budget allocations, timelines, milestones, etc.

Why this matters: Going from "we should spend more on AI safety" to “here’s how we can spend more on AI safety” is non-obvious, yet we might be in this scenario if, e.g., a major government (not necessarily the US government; even smaller countries have big budgets) wakes up to transformative AI risk or if philanthropic interest spikes after some big warning shot.

What the first few months could look like: Interviewing relevant experts (e.g., AI safety researchers, policy people, funders) to inform your view on the top priorities, researching existing large-scale research funding programs (DARPA, NSF, etc.) to see if there’s anything you can learn, developing a taxonomy of different intervention areas (ideally with rough budget allocations), and creating a few concrete "shovel-ready" program proposals that could quickly absorb significant funding.

8. AI tools for fact-checking

What: An organization (possibly a for-profit) that builds AI-powered fact-checking tools designed for transparency and demonstrable lack of bias. One example product could be a browser extension that fact-checks claims in real-time. Unlike other AI research tools (e.g., Deep Research), this could prioritize visible chain-of-thought reasoning and open-source code to make it more trustworthy. The organization would ideally conduct rigorous bias evaluations of its own tools, publish research papers, and maintain public datasets of fact-checking decisions. Beyond the browser extension, they could develop APIs for platforms to integrate fact-checking.

Why this matters: Better societal epistemics would be valuable for many obvious reasons, but we’d really benefit from tools that help to provide decisionmakers with quality information — especially when AI is causing rapid and unprecedented societal change. It’s hard to overstate how important smart decision-making will be if our institutions get stress-tested by fast AI progress.

What the first few months could look like: Probably you’d want to follow basic sensible advice on starting a company — for example, this book apparently has a good reputation. I haven’t run a startup before, so take this with a giant grain of salt, but I’d guess it would be good to find good advisors, experiment with a basic MVP, research existing fact-checking methodologies and their limitations, and talk to potential users to understand what they actually want.

9. AI auditors

What: Start a company to build AI agents that can conduct compliance audits. These systems would automate the labor-intensive tasks that might otherwise be performed by human auditors: reviewing documentation and logs, checking implementation of safety procedures, etc. Critical caveat: this would need extraordinarily careful implementation. Giving AI systems a bunch of access to labs' internal systems creates massive attack surfaces — the AI auditors could be compromised in a variety of ways.

Why this matters: Human-led safety audits would be expensive, time-consuming, and face inherent trust problems. AI auditors could theoretically enable more frequent and comprehensive compliance checks while avoiding information leakage (you can much more easily wipe an AI's memory compared to a human). However, the security risks of implementing this carelessly would be severe: an adversary could compromise the auditor to falsely certify unsafe practices or to introduce some vulnerabilities into the lab. I honestly wouldn't be surprised if security folks at AI labs would see this and think "absolutely not, no way we're letting an AI system have that kind of access" — and honestly, fair enough. But I think it’s a cool enough idea that merits more exploration.

What the first few months could look like: Again, probably you’d want to follow something close to this. I haven’t run a startup before, so take this with a giant grain of salt, but I’d guess it would be good to find good mentors/advisors, research existing auditing proposals, and talk to people at labs to understand constraints that might not be obvious from the outside.

10. AI economic impacts tracker

What: An organization focused on examining how AI is transforming the economy, through both original research and reporting. They'd do things like: survey workers in management roles about how often they use AI (and in what contexts), conduct economics-style research to understand AI’s contribution to worker productivity (like this study by METR, or this paper but not bullshit), and investigate whether/how AI is impacting hiring decisions at large companies (e.g., this was a big claim — is it true?). Think Epoch’s approach but applied to economic impacts rather than compute, algorithmic progress, energy usage, and other capabilities measures. They might partner with companies to get granular deployment data, commission longitudinal studies following specific firms as they integrate AI, and maintain a comprehensive database of AI adoption metrics that researchers and policymakers could use to make sense of AI’s diffusion across the economy. Anthropic’s Economic Futures program is pretty similar to what I have in mind.

Why this matters: Current approaches to tracking AI progress — training compute, benchmark scores, capability demos, energy usage — are valuable but paint an incomplete picture. There's often a massive gap between "AI can nail the bar exam" and "AI is meaningfully changing how paralegals do their jobs", and we’d benefit from more people combing through the data (and in some cases, going out and gathering it) to make sense of that gap. An organization focused on producing analysis at this level of abstraction could make it easier to tell how close we are to truly wild societal impacts.

What the first few months could look like: Talking to the folks working on this at Anthropic (and other labs if they’re doing similar work), identifying 2-3 specific sub-areas to focus on initially (and doing a really solid job at analyzing them before expanding), hiring support staff (e.g., RAs), and getting a bunch of advisors to review your work.


Caveats, hedges, clarifications

Here are some important caveats:


If you’re interested in applying for funding to work on one of these projects (or something similar), check out our RFP.

1 Emphasis on “I” — these are my takes only. I'm not speaking on behalf of Open Philanthropy.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 人工智能风险 技术治理 AI发展 AI伦理
相关文章