Some Comments on Recent AI Safety Developments

Published on November 9, 2024 4:44 PM GMT

Overview of New Developments

As of today, November 9th, 2024, three major AI developers have madeclear their intent to begin or shortly begin offering tailored AIservices to the United States Department of Defense and relatedpersonnel.

Anthropic:https://arstechnica.com/ai/2024/11/safe-ai-champ-anthropic-teams-up-with-defense-giant-palantir-in-new-deal/

Meta: https://scale.com/blog/defense-llama https://www.theregister.com/2024/11/06/meta_weaponizing_llama_us/

OpenAI:https://fortune.com/2024/10/17/openai-is-quietly-pitching-its-products-to-the-u-s-military-and-national-security-establishment/

These overtures are not only directed by AI companies. There are alsotop downpressuresto expand the military use of AI from the US government under Joe Biden.Ivanka Trump has posted in favourof LeopoldAschenbrenner's Situational Awareness document, suggesting that theincoming Trump administration will continue this trend.

For obvious reasons, I believe the same is happening in China, andpossibly (to a lesser extent) other nations with advanced AI labs (UK).I focus on the American case because most of the most advanced AI labsare in fact American companies.

Analysis and Commentary

In some sense this is not new information. In January 2024, the paper"Escalation Risks from Language Models in Military and DiplomaticDecision-Making" warned that"Governments are increasingly considering integrating autonomous AIagents in high-stakes military and foreign-policy decision-making". Alsoin January, OpenAI removed language prohibiting theuseof its products in military or warfare-related applications. InSeptember, after the release of o1, OpenAI also claimed that its toolshave the capabilitiesto assist with CBRN (Chemical, Biological, Radiological, or NuclearWeapons) development. The evaluation was carried out with the assistanceof experts who were familiar with the procedures necessary for"biological threat creation", who rated answers based on both theirfactual correctness as well as "ease of execution in wet lab". From thisI infer that at least some of these experts have hands-on experiencewith biological threats, and the most likely legal source of suchexperts is Department of Defense personnel. Given these facts, todemonstrate such capabilities and then deny the security communityaccess to the model seems both unviable and unwise for OpenAI.

Furthermore, it seems unlikely that the security community does not haveaccess already to consumer facing models, given the weak controls placedon the dissemination of Llama model weights and the ease of creatingOpenAI or Anthropic accounts. Therefore, I proceed from the assumptionthat the offerings in these deals are tailor made to the securitycommunity's needs. This claim is explicitly corroborated in the case ofDefense Llama, which claims to be trained on "a vast dataset, includingmilitary doctrine, international humanitarian law, and relevant policiesdesigned to align with the Department of Defense (DoD) guidelines forarmed conflict as well as the DoD's Ethical Principles for ArtificialIntelligence". The developers also claim it can answer operationalquestions such as "how an adversary would plan an attack against a U.S.military base", suggesting access to confidential information beyondbasic doctrine. We also know that the PRC has developed similartoolsbased on open weights models, further reducing the likelihood of suchofferings not being made to the US military.

Likely Future Developments

There have been calls for accelerated national security involvement inAI, most notably from writers such as LeopoldAschenbrenner (a former OpenAIemployee) and DarioAmodei (CEO ofAnthropic). Amodei in particular favours an "entente" strategy in whicha coalition of democratic nations races ahead in terms of AI developmentto outpace the AI developments of autocracies and ensure a democraticfuture. These sentiments echo the race to create the atomic bomb andother similar technologies during World War II and the Cold War. I willnow explore the impact of such sentiments, which appear to be acceptedin security circles given the information above.

The first matter we must consider is what national security involvementin AI will mean for the current AI development paradigm, which consistsof both open weights, open source, and closed source developersorganised as private companies (OpenAI, Anthropic), subsidiaries ofexisting companies (Google Deepmind, Meta FAIR), and academicorganisations (EleutherAI). Based on ideas of an AI race and loominggreat power conflict, many have proposed a "Manhattan Project for AI".My interpretation of this plan is that all further AI development wouldbe centrally directed by the US government, with the Department ofDefense acting as the organiser and funder of such development. However,I believe that such a Manhattan Project-style scenario is unlikely. Notonly is access to AI technology already widespread, many of thecontributions are coming from open source or otherwise non-corporate ornon-academic contributors. If acceleration is the goal, putting therabbit back in the hat would be counterproductive and wasteful.

Instead, I suggest that the likely model for AI development will besimilar to the development of cryptographic technology in the 20thcentury. While it began as a military exercise, once civilian computingbecame widespread and efforts to constrain knowledge of cryptanalysisfailed, a de facto dual-track system developed. Under this scheme, theNSA (representing the government and security community) would protectany techniques andadvancementsit developed to maintain a competitive edge in the geopolitical sphere.Any release of technology for openuse would becarefully arranged to maintain this strategic advantage, or only beconducted after secrecy waslost. At the same time,developments from the open source, commercial, or academic communitywould be permitted to continue. Any successes from the public orcommercial spheres would be incorporated, whether through hiringpromising personnel, licensing technology, or simply implementing theirown implementations now that feasibility was proven. A simple analogy isthat of a one way mirror in an interrogation room: those in the room(i.e. the public, corporate researchers, academics) cannot look behindthe mirror, but those behind the mirror (the national securitycommunity) can take advantage of any public developments and bring them"behind the mirror" if necessary. A one way filter is establishedbetween the public sphere and the security sphere, an arrangement we seemirrored in the ways AI technology is now being licensed, viaIL6-compliant AWS services (in the case of Anthropic) and securedprivate servers (in the case of Defense Llama). The public-facingcompanies are still active, but a veil of secrecy is established at theinterface with the national security community. To be clear: there isno public indication known to me that proprietary model weights arebeing shared with the security community. On the other hand, suchagreements would likely live behind the one-way mirror.

Having established that AI developers will likely continue to exist intheir current form, we must next ask how the development of AI systemswill be influenced by a growing rapproachment with the securitycommunity. It is well known that existing AI alignment efforts arepost-hoc. That is to say, a base or "unaligned" system is created whichhas both desired and undesired (dual-use) capabilities, which is then"brought into compliance" via methods like RLHF and Constitutional AI.Therefore, the existence of the commercial facing "aligned" modelsimplies the necessary existence of "base" models, which have not gonethrough this post-training modification process. Furthermore, RLHF andConstitutional AI are both value-agnostic methods. They merely promotethe likelihood of certain types of responses while suppressing others,with the desired type of response monitored either by an oracleemulating human feedback or an fully self-supervised model. This meansthat it would be feasible to create "anti-aligned" models that featuremore undesirable responses. Such models might, for example, specialisein developing novel cyberweapons or creating plausible propaganda. Ishould also expect specialised scaffolding to make use of and enhanceharmful capabilities to be developed. Amongst these enhancedcapbilities are those most associated with profoundly harmful outcomes,e.g. the ability to autonomously replicate and take over computersystems, the ability to develop novel bioweapons, and the ability tocripple infrastructure.

Why would these systems be created? While such capabilities wouldnormally be regarded as harmful per (for example) OpenAI's riskevaluation metrics, in a military context they are useful, expectedbehaviours for an AI system to be regarded as "helpful". In particular,under the great power conflict/arms race mindset, the possibility ofChina or another enemy power developing these tools first would beunthinkable. On the other hand, the ownership of such a tool would be apowerful bargaining chip and demonstration of technological superiority.Therefore, I believe that once the cordon of secrecy is in place, therewill be a desire and incentive for AI developers to produce suchanti-aligned systems tailored for military use. To return to thecryptography metaphor, while easier methods to break cryptosystems orintrude into secured networks are regarded as criminal and undesirableoutcomes in general society, for the NSA they are necessary operationaltools.

Some of you will protest that the standards of the US security communitywill prevent them from creating such systems. This is a hypothesis thatcan never be tested: Because of the one-way mirror system, we (actors inthe public sphere) will never know if such models are developed withouthigh-level whistleblowers or leaks. Furthermore, in a great powerconflict context the argument for anti-alignment is symmetric: any greatpower with national security involvement in AI development will be awareof these incentives. Perhaps other powers will not be so conscientious,and perhaps American corporations will be happy to have two one-waymirrorsinstalled: See https://en.wikipedia.org/wiki/Dragonfly_(search_engine).

What specifically has changed?

For many of you, these arguments will likely be familiar. However, thespecific agreements between US AI developers and the Department ofDefense, complete with the implementation of the one-way mirror, signalsthe proper entry of the US into the AI militarisation race. Even if theUS military does not develop any anti-aligned capabilities beyond thoseof commercially available models, other great powers will take notice ofthis development and make their own estimations.

Furthermore, the one-way mirror effect is localised: Chineseanti-alignment efforts cannot benefit from American public-spheredevelopments as easily as American anti-alignment efforts can. This isbecause of several factors including access to personnel, language andcultural barriers, and protective measures like information securityprotocols and limits to foreignaccess.Thus far, American AI development efforts are world leading (notice howthe Chinese military usesLlamafor their purposes rather than local LLM models!). This means thatAmerican anti-alignment efforts, if they ramp up properly, will be worldleading as well.

Implications for AI safety work

Based on the above, I make several inferences for AI safety work, themost important of which is this: Under the present AI developmentparadigm, the capabilities of the best anti-aligned AI system will belower bounded by the capabilities of the best public aligned AIsystem. Notice the asymmetry in knowledge we will possess aboutanti-aligned and aligned systems: any advances behind the anti-alignedsystems need not be public, again because of the one-way mirror.

The second inference is this: Any new public capabilitiesinnovations will be symmetrically applied. In other words, any attemptto increase the capabilities of aligned models will be applied toanti-aligned models so long as alignment is a post-trainingvalue-agnostic process. Any method discovered behind the one-way mirror,however, need not be shared with the public.

The third inference is this: Most new public alignment work willalso become anti-alignment work. For example, improvements toRLHF/post-training alignment or new methods of model control can bedirectly employed in service of anti-alignment. Similarly, developmentsthat make AI systems more reliable and effective are dual-use by theirnature. Mechanistic interpretability and other instrumental science workwill remain as it has always been, effectively neutral. Betterexplanations of how models work can likely benefit both alignment andanti-alignment because of the aforementioned symmetric nature of currentalignment methods.

The final inference is this: Public AI safety advocates are nowdefinitively not in control of AI development. While there have alwaysbeen AI developers who resisted AI safety concerns (e.g. Yann LeCun atFAIR), and recently developers like OpenAI have signalled moves awayfrom an AI safety focus to a commercial AI focus, for a long time itcould be plausibly argued that most consequential AI development ishappening somewhat in the open with oversight and influence from figureslike Yoshua Bengio or Geoffrey Hinton who were concerned about AIsafety. it is now clear that AI safety advocates who do not havesecurity clearance will not even have full knowledge of cutting edge AIdevelopments, much less any say about their continued development ordeployment. The age of public AI safety advocates being invited to thetable is over.

There are exceptions to these inferences. For example, if a private AIdeveloper with no one-way mirror agreement works on their own to developa private, aligned model with superior capabilities, this would be atriumph over anti-alignment. However, not only have all major AIdevelopers rapidly acquiesced to the one-way mirror arrangement (withsome notableexceptions),any such development would likely inspire similar anti-alignmentefforts, especially due to the porous nature of the AI development andAI safety communities. It is also possible that advocates in themilitary will resist such developments due to a clear eyed understandingof the relevant risks, and instead push for the development of positivealignment technologies with military backing. At the risk of repeatingmyself, this is a fight we will not know the outcome of until it iseither inconsequential or too late.

What should we do?

To be short: I don't know. Many people will argue that this is anecessary step, that US anti-alignment efforts will be needed to counterChinese anti-alignment efforts, that giving autocracies leverage overdemocracy in the form of weaponised AI is a death sentence for freedom.Similar arguments have been deployed by the NSA in defense of its spyingand information gathering efforts. However, others have pointed out thedelusionof trying to control an AI borne of race dynamics and malicious intent,and point to historical overreaches and abuses ofpower by the Americannational security community. Indeed, the existence of high-profile leakssuggests that escape or exfiltration of dangerous AI systems from behindthe one-way mirror is possible, making them a risk even if you havefaith in the standards of the US military. Perhaps now is also anopportune time to mention the links between the incoming administrationand its ties to neo-reactionarypolitics,as ably demonstrated in the Project 2025transition plan.

One thing I think we should all not do is continue withbusiness-as-usual under the blithe assumption that nothing has changed.A clear line has been crossed. Even if the national security communityrenounces all such agreements the day after this post goes live, stepshave been taken to reinforce the existing race dynamic and other nationswill notice. And as for the public? We are already staring at themetaphorical one way mirror, trying to figure out what is staring backat us from behind it.

Discuss

Overview of New Developments

Analysis and Commentary

Likely Future Developments

What specifically has changed?

Implications for AI safety work

What should we do?

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签