少点错误 2024年10月21日
A Rocket–Interpretability Analogy
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了美国联邦预算在太空竞赛中的投入,以及对齐领域中 interpretability 的情况。还提到不能依赖周围组织系统,应形成独立观点以避免AI带来的风险。同时指出某些领域的现状并非有利于解决真正的问题。

💥美国联邦预算曾有4.4%投入太空竞赛,且登陆月球所需技术与在苏联城市登陆火箭的技术相似,这一情况令人惊讶。

🤔对齐领域似乎越来越被 interpretability(及 obedience)主导,部分打开神经网络黑箱的技术可助实验室发现新问题,但该方向的努力与预期效果似乎不太匹配。

🚫不能依赖周围组织系统的正确引导,商业激励易使人们的理想主义能量流向两用甚至主要提升能力的安全工作,许多培训项目也以此为导向。

Published on October 21, 2024 1:55 PM GMT

 1. 

4.4% of the US federal budget went into the space race at its peak.

This was surprising to me, until a friend pointed out that landing rockets on specific parts of the moon requires very similar technology to landing rockets in soviet cities.[1]

I wonder how much more enthusiastic the scientists working on Apollo were, with the convenient motivating story of “I’m working towards a great scientific endeavor” vs “I’m working to make sure we can kill millions if we want to”.

2.

The field of alignment seems to be increasingly dominated by interpretability. (and obedience[2])

This was surprising to me[3], until a friend pointed out that partially opening the black box of NNs is the kind of technology that would scaling labs find new unhobblings by noticing ways in which the internals of their models are being inefficient and having better tools to evaluate capabilities advances.[4]

I wonder how much more enthusiastic the alignment researchers working on interpretability and obedience are, with the motivating story “I’m working on pure alignment research to save the world” vs “I’m building tools and knowledge which scaling labs will repurpose to build better products, shortening timelines to existentially threatening systems”.[5]

3.

You can’t rely on the organizational systems around you to be pointed in the right direction, and there are obvious reasons for commercial incentives to want to channel your idealistic energy towards types of safety work which are dual-use or even primarily capabilities enabling. And for similar reasons, many of the training programs prepare people for the kind of jobs which come with large salaries and prestige, as a flawed proxy for people moving the needle on x-risk.

If you’re genuinely trying to avert AI doom, please take the time to form inside views away from memetic environments[6] which are likely to have been heavily influenced by commercial pressures. Then back-chain from a theory of change where the world is more often saved by your actions, rather than going with the current and picking a job with safety in its title as a way to try and do your part.

  1. ^

    Space Race - Wikipedia:

    It had its origins in the ballistic missile-based nuclear arms race between the two nations following World War II and had its peak with the more particular Moon Race to land on the Moon between the US moonshot and Soviet moonshot programs. The technological advantage demonstrated by spaceflight achievement was seen as necessary for national security and became part of the symbolism and ideology of the time.

  2. ^

    Andrew Critch:

    I hate that people think AI obedience techniques slow down the industry rather than speeding it up. ChatGPT could never have scaled to 100 million users so fast if it wasn't helpful at all.

     

    Making AI serve humans right now is highly profit-aligned and accelerant.

     

    Of course, later when robots could be deployed to sustain an entirely non-human economy of producers and consumers, there will be many ways to profit — as measured in money, materials, compute, energy, intelligence, or all of the above — without serving any humans. But today, getting AI to do what humans want is the fastest way to grow the industry.

  3. ^

    These paradigms do not seem to be addressing the most fatal filter in our future: Strongly coherent goal-directed agents forming with superhuman intelligence. These will predictably undergo a sharp left turn and the soft/fuzzy alignment techniques which worked at lower power levels fail simultaneously and as the system reaches high enough competence to reflect on itself, its capabilities, and the guardrails we built.

    Interpretability work could plausibly help with weakly aligned weakly superintelligent systems that do our alignment homework for the much more capable systems to come. But the effort going into this direction seems highly disproportionate to how promising it is, is not backed by plans to pivot to using these systems to do a quite different style of alignment research that's needed, and generally lacks research closure to avert capabilities externalities.

  4. ^

     From the team that broke the quadratic attention bottleneck:

    Simpler sub-quadratic designs such as Hyena, informed by a set of simple guiding principles and evaluation on mechanistic interpretability benchmarks, may form the basis for efficient large models.

  5. ^

    Ask yourself: “Who will cite my work?”, not "Can I think of a story where my work is used for good things?"

    There is work in these fields which might be good for x-risk, but you need to figure out if what you're doing is in that category to be good for the world.

  6. ^

    Humans are natural mimics, we copy the people who have visible signals of doing well, because those are the memes which are likely to be good for our genes, and genes direct where we go looking for memes.

    Wealth, high confidence that they’re doing something useful, being part of a growing coalition; great signs of good memes. All much more possessed by people in the interpretability/obedience kind of alignment than the old-school “this is hard and we don’t know what we’re doing, but it’s going to involve a lot of careful philosophy and math” crowd.

    Unfortunately, this memetic selection is not particularly adaptive for trying to solve alignment.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

太空竞赛 interpretability 组织系统 AI 风险
相关文章