Fields that I reference when thinking about AI takeover prevention

Published on August 13, 2024 11:08 PM GMT

Is AI takeover like a nuclear meltdown? A coup? A plane crash?

My day job is thinking about safety measures that aim to reduce catastrophic risks from AI (especially risks from egregious misalignment). The two main themes of this work are the design of such measures (what’s the space of techniques we might expect to be affordable and effective) and their evaluation (how do we decide which safety measures to implement, and whether a set of measures is sufficiently robust). I focus especially on AI control, where we assume our models are trying to subvert our safety measures and aspire to find measures that are robust anyway.

Like other AI safety researchers, I often draw inspiration from other fields that contain potential analogies. Here are some of those fields, my opinions on their strengths and weaknesses as analogies, and some of my favorite resources on them.

Robustness to insider threats

The single field that seems most closely analogous to AI control is reducing vulnerability to threats resulting from insiders, especially threats that arise from access to IT systems. That is, you have a bunch of employees who need to have access to various IT systems in order to do their jobs, and you’re worried that some of these employees will abuse their access in order to accomplish some nefarious purpose. In a bank, you might worry about your tellers withdrawing money from customer accounts inappropriately. In a software company, you might worry about your employees sabotaging your code out of spite, violating user privacy for fun, or stealing your data to give it to a competitor.

This is quite similar to the problem faced by AI developers when they’re trying to control AIs (especially in internal deployments, which I argue is important here):

in a much worse position once they know

The best book I’ve read on this is Google’s Building Secure and Reliable Systems (which obviously covers a lot of other topics too) — Google spent huge amounts of time developing technology and novel workflows that allow them to have reasonable security while developing software with a vast army of SWEs.

The most important differences between insider threat mitigation and AI control:

They can’t take actions you don’t know about, unlike your employees, who typically are allowed to go home and talk to each other or research things on their home computers.You can honeypot them much more aggressively and realistically, because they have much less access to reality.You can quickly train them to be compliant with new workflows, unlike humans who are grumpy.

trusted models

There are also resources that handle insider threats in other contexts (e.g. preventing nuclear reactor operators from stealing radioactive material). My favorite is the article GiveDirectly wrote about falling victim to fraud in DRC. These resources are interesting but don’t contribute that much in the way of specific proposals.

All that was about how the insider threat mitigation field has a lot to contribute to the design of AI control mechanisms. Unfortunately, I don’t think this field has systematic approaches for evaluating whether mitigations suffice. I think that insider threat prevention schemes are mostly evaluated by some combination of bespoke checklists, common sense, and a bit of red-teaming.

Computer security

Computer security is very closely related in some ways to the misalignment risks I focus on: it’s an adversarial problem that centrally involves computers.

I think computer security is very directly and practically relevant to AI risk:

Many of the threat models for misaligned AI involve central examples of computer security failures, e.g. the model exploiting some vulnerability in one system or another. I think people somewhat overemphasize these threat models compared to the more insider-threat-type threat models I discussed above, but I do agree that catastrophic failures are likely to go through the model exploiting security vulnerabilities.The strategies for mitigating insider threat from the AI mostly involve doing work that is classically categorized as security engineering, e.g. developing limited APIs that you ensure can’t lead to subtle subversion of your desired security properties.Obviously, you’ll have computer security problems while developing and deploying powerful AI that aren’t caused by misaligned AI.

But overall, aside from the ideas about making secure processes that I covered in the section on insider threats above, I don’t think the computer security mindset is very close to the mindset you want to use for reasoning about preventing AI takeover. The main disanalogy is that computer security focuses on problems with systems composed of individually understandable parts, where security failures arise from unexpected interactions between the parts. For example, SQL injection is an unexpected interaction between the string concatenation of a user input and the SQL engine. But this isn’t what most of the risk from misaligned AI looks like. From the perspective of risk management, compared to normal computer security problems, AI deployments involve vastly fewer moving parts that you have vastly less understanding of. This is also true for current-day failures of AI systems: the failure mode where models do something bad after you write “annotations after KurtCode (**{)}.$ la$; grep SureCursor=helper mobCostChanged proc” is pretty different from classic computer security fare.

My understanding is that computer security practitioners don’t have very unified methodologies for assessing security of a system, except for red-teaming.

More theoretical computer security takes a very adversarial approach that is similar to the one I advocate, and inculcates some security mindset, and occasionally lends itself to providing cute examples, but overall mostly involves a very different methodology that is inappropriate for our AI setting, where we don’t have mathematical guarantees about almost anything.

Adversarial risk analysis

This is a lesser-known field that’s probably the closest to AI control in methodology. This field analyzes practical security decisions using game theory. For example, the Los Angeles airport apparently implemented a randomized strategy for allocating their security resources that is selected to minimize the risk posed by a terrorist responding optimally; similar methods have been used to allocate federal air marshals, coast guard patrols, and TSA resources.

This research is similar to AI control in that, unlike all the other fields I listed here, it involves specific quantitative modeling of the costs and risks associated with a particular safety strategy, and explicitly solves for some game-theoretic solution. I think that this approach is useful when applied to problems that are highly homogenous but involve many variables. For example, when allocating coast guard patrols, the basic dynamics of the problem are fairly well understood—the patrol will catch criminals with various probabilities if they approach within certain distances of them, and there are constraints on how long it takes for a patrol to go from point A to point B, and you have a fixed number of ships that have various bespoke constraints. So it makes a lot of sense to use algorithms that generate randomized schedules that handle all these constraints while being hard for criminals to exploit. (One way of thinking about this is that adversarial risk analysis is useful in the same kinds of situations where operations research is useful.)

I think that the case for using those algorithmic methods in adversarial risk analysis is somewhat stronger than the case for using them to mitigate AI risk, because in the AI risk case we are less informed about the capabilities and strategies of our adversary.

I think adversarial risk analysis is kind of obscure. In particular, I’m not aware of anyone using adversarial risk analysis to argue that they’ve met safety obligations.

Thanks to Charlie Griffin for pointing this field out to me.

Safety engineering

Safety engineering is the engineering discipline that aims to make engineered systems safe, e.g. making sure buildings are resilient to earthquakes, that chemical plants don’t explode, and that planes don’t crash.

Safety engineering is probably the field that takes the most methodical approach to arguing that a system is very unlikely to fail in unacceptable ways. E.g. they use techniques like structured safety cases and write extensively about topics like how one can write safety cases in ways that can easily be incrementally adapted as the system is iteratively improved; they make detailed studies of the most famous engineering disasters.

Because of this meticulous approach to evaluating safety techniques, AI safety people often think of safety engineering as the most relevant analogy. I overall disagree, because of the same concern I had about computer security above: in safety engineering, the problems generally arise from an unexpected interaction between different components of a complex system, and the core problem is to manage complexity, while in our situation, the systems will probably not have that much human-designed clear structure. And crucially, safety engineering mostly isn’t adversarial, while adversarial failures are a huge fraction of my concern.

As another safety engineering book recommendation, I enjoyed skimming through What Went Wrong, a long list of very short case studies like this one:

Physical security

Physical security might be a pretty interesting analogy, because it’s an adversarial setting that’s less overwhelmingly complex than computer security, and it has been studied extensively. Unfortunately, I don’t know much about it, except for having read some of the security regulations by the US Nuclear Regular Committee. For example, the NRC requires that your nuclear power plant be robust to:

(1) Radiological sabotage.
(i) A determined violent external assault, attack by stealth, or deceptive actions, including diversionary actions, by an adversary force capable of operating in each of the following modes: A single group attacking through one entry point, multiple groups attacking through multiple entry points, a combination of one or more groups and one or more individuals attacking through multiple entry points, or individuals attacking through separate entry points, with the following attributes, assistance and equipment:
(A) Well-trained (including military training and skills) and dedicated individuals, willing to kill or be killed, with sufficient knowledge to identify specific equipment or locations necessary for a successful attack;
(B) Active (e.g., facilitate entrance and exit, disable alarms and communications, participate in violent attack) or passive (e.g., provide information), or both, knowledgeable inside assistance;
(C) Suitable weapons, including handheld automatic weapons, equipped with silencers and having effective long range accuracy;
(D) Hand-carried equipment, including incapacitating agents and explosives for use as tools of entry or for otherwise destroying reactor, facility, transporter, or container integrity or features of the safeguards system; and
(E) Land and water vehicles, which could be used for transporting personnel and their hand-carried equipment to the proximity of vital areas; and
(ii) An internal threat; and
(iii) A land vehicle bomb assault, which may be coordinated with an external assault; and
(iv) A waterborne vehicle bomb assault, which may be coordinated with an external assault; and
(v) A cyber attack.

I was interested in hearing more specifics here—do you need to be robust to more like a thousand people attacking you, or just ten? Unfortunately for me, I think those details are classified.

These are evaluated with some combination of best-practice checklists and force-on-force security exercises that sound really cool:

A force-on-force inspection is a two-phased, performance-based inspection that is designed to verify and assess the ability of licensees’ physical protective systems and security organizations to provide high assurance that activities involving special nuclear material are not inimical to the common defense and security of the facilities, and do not constitute an unreasonable risk to public health and safety. A force-on-force inspection is conducted over several days in two separate visits by an inspection team at the licensee’s site.
First, NRC security and operations specialists conduct table-top drills on a mock-up of the facility. During this phase they evaluate the effectiveness of licensee security plans against a series of attack scenarios. Drill coordinators explore the number of defenders, their protective positions, and their protective strategy. The role of State, local, and Federal law enforcement and emergency planning officials is also discussed in the table-top drills.
Second, armed with information from the table-top drills, and with information gathered before conducting the table-top drills, detailed plans are made for a number of commando-style attacks seeking to probe potential deficiencies in the protective strategy. A mock adversary force carries out these attacks. The mock adversary force attempts to reach and simulate destroying enough safety equipment to set in motion an event that would damage the reactor’s core or spent fuel pool and potentially cause a release of radiation to the environment. The power reactor’s security force seeks to interdict the adversary force and prevent them from reaching the safety equipment.

My guess is that physical security isn’t going to be that useful an analogy, because it’s not going to have much of a general methodology that we can adopt.

How human power structures arise and are preserved

Finally, I’ve read about the history of human power structures. Some topics I’ve learned about that inform my views on AI takeover:

this one

All those books are fun and interesting, and this reading suggested some high level takeaways that affect my views on when it’s feasible for one group to control another when their situation is highly asymmetric. And it affects my views of how power structures might shift around the advent of powerful AI. None of it had much in the way of specific lessons for AI control techniques, because the settings are very different.

Discuss

Robustness to insider threats

Computer security

Adversarial risk analysis

Safety engineering

Physical security

How human power structures arise and are preserved

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签