少点错误 2024年09月15日
Proveably Safe Self Driving Cars
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

探讨构建可证明安全的 AI 系统,以自动驾驶汽车为例,涉及多方面的安全保障及相关问题

正式验证在某些领域已可行且被应用,如微内核的可证明验证代码,为进一步应用奠定基础。以可证明安全的内核为起点,构建自动驾驶汽车的安全系统,需解决诸多其他安全问题

保证汽车输入的可靠性是关键。可通过展示传感器的特定故障率等,提供低概率的保证,将风险分析保障整合到更大系统中,并在此基础上构建更广泛的保障

证明安全驾驶需要关于安全驾驶的可证明正式声明。责任敏感安全(RSS)形式化了汽车避免事故所需的行为,证明汽车在传感器数据正确的情况下遵循 RSS 模型是安全的

解决自动驾驶汽车中 AI 系统的安全问题,可通过外部用可证明代码管控 AI 系统,或训练更简单可解释的模型,也可使用可构造 AI 替代黑箱系统

考虑对抗性问题,如恶意利用自动驾驶汽车的限制和规则集进行攻击,以及对系统的工程攻击,但现存法律系统可处理此类问题

Published on September 15, 2024 1:58 PM GMT

I’ve seen a fair amount of skepticism about the “Provably Safe AI” paradigm, but I think detractors give it too little credit. I suspect this is largely because of idea inoculation - people have heard an undeveloped or weak man version of the idea, for example, that we can use formal methods to state our goals and prove that an AI will do that, and have already dismissed it. (Not to pick on him at all, but see my question for Scott Aaronson here.)

I will not argue that Guaranteed Safe AI solves AI safety generally, or that it could do so - I will leave that to others. Instead, I want to provide a concrete example of a near-term application, to respond to critics who say that proveability isn’t useful because it can’t be feasibly used in real world cases when it involves the physical world, and when it is embedded within messy human systems. 

I am making far narrower claims than the general ones which have been debated, but at the very least I think it is useful to establish whether this is actually a point of disagreement. And finally, I will admit that the problem I’m describing would be adding proveability to a largely solved problem, but it provides a concrete example for where the approach is viable.

A path to provably safe autonomous vehicles

To start, even critics agree that formal verification is possible, and is already used in practice in certain places. And given (formally specified) threat models in different narrow domains, there are ways to do threat and risk modeling and get different types of guarantees. For example, we already have proveably verifiable code for things like microkernels, and that means we can prove that buffer overflows, arithmetic exceptions, and deadlocks are impossible, and have hard guarantees for worst case execution time. This is a basis for further applications - we want to start at the bottom and build on provably secure systems, and get additional guarantees beyond that point. If we plan to make autonomous cars that are provably safe, we would build starting from that type of kernel, and then we “only” have all of the other safety issues to address.

Secondly, everyone seems to agree that provable safety in physical systems requires a model of the world, and given the limits of physics, the limits of our models, and so on, any such approach can only provide approximate guarantees, and proofs would be conditional on those models. For example, we aren’t going to formally verify that Newtonian physics is correct, we’re instead formally verifying that if Newtonian physics is correct, the car will not crash in some situation.

Proven Input Reliability

Given that, can we guarantee that a car has some low probability of crashing?

Again, we need to build from the bottom up. We can show that sensors have some specific failure rate, and use that to show a low probability of not identifying other cars, or humans - not in the direct formal verification sense, but instead with the types of guarantees typically used for hardware, with known failure rates, built in error detection, and redundancy. I’m not going to talk about how to do that class of risk analysis, but (modulus adversarial attacks, which I’ll mention later,) estimating engineering reliability is a solved problem - if we don’t have other problems to deal with. But we do, because cars are complex and interact with the wider world - so the trick will be integrating those risk analysis guarantees that we can prove into larger systems, and finding ways to build broader guarantees on top of them.

But for the engineering reliability, we don’t only have engineering proof. Work like DARPA’s VerifAI is “applying formal methods to perception and ML components.” Building guarantees about perception failure rates based on the sensors gives us another layer of proven architecture to build on. And we could do similar work for how cars deal with mechanical failures, other types of sensor failures, and so on as inputs to the safety model. Of course, this is not a challenge particularly related to AI, and it is a (largely solved) problem related to vehicle reliability, and traditional engineering analysis and failure rates could instead be one of the inputs to the model assumptions, with attendant issues dealing with uncertainty propagation so we get proven probabilistic guarantees at this level as well.

Proven Safe Driving

Proving that a car is not just safe to drive, but is being driven safely requires a very different set of assumptions and approaches. To get such an assurance, we’d need provable formal statements about what safe driving is, in order to prove them. And it’s important to figure out what safety means in multi-agent sociotechnical systems. For example, we say someone is a safe driver when they drive defensively, even if another car could crash into them. That’s because safety in multiperson systems isn’t about guaranteeing no harm, it’s about guaranteeing that the agents behavior doesn’t cause that harm.

Luckily, it turns out that there’s something remarkably close to what we want already. Responsibility Sensitive Safety (RSS) formalizes what a car needs to do not to cause accidents. That is, if a car drives safely, any accidents which could occur will be attributable to other cars actions In the case of RSS, it’s provable that if other cars follow the law, and/or all cars on the road abide by the rules, those cars will only have accidents if their sensors are incorrect. Of course, if another car fails to abide by the rules, safety isn’t guaranteed - but as we’ll mention later, safe driving can’t mean that the car cannot be hit by a negligent or malicious driver, or the safety we’re working towards is impossible! 

Proving that cars won’t cause crashes now builds on the risk analysis we described as provable above. Relying on assumptions about other car behavior is a limit to provable safety - but one that provides tremendously more leverage for proof than just removing buffer overflow attacks. That is, if we can show formally that given correct sensor data, a car will only do things allowed in the RSS model, and we build that system on top of the sensors described above, we have shown that it is safe in a very strong sense. 

This three part blueprint for provable safety of cars can address the levels in between the provably safe kernel, the responsibility sensitive safety guarantees, and the risk analysis for sensors. If we can prove that code running on the safe kernel can proveably provide the assurances needed for driving, on the condition that the sensors work correctly, and can provide engineering reliability results for those sensors, we have built a system that has provably bounded risk.

Provably Safe ML

Unfortunately, self-driving cars don’t actually just use code of types that can be formally verified, they use neural networks - systems which are poorly understood and vulnerable to a wide variety of hard to characterize failures. Thankfully, we do not need to solve AI safety in general to have safety narrowly. How can we do this?

One potential solution is to externally gate the AI system with provable code. In this case, the driving might be handled by an unsafe AI system, but its behavior would have “safety in the loop” by having simpler and provably safe code restrict what the driving system can output, to respect the rules noted above. This does not guarantee that the AI is a safe driver - it just keeps such systems in a provably safe box.

That isn’t, of course, the only approach. Another is to have the AI system trained to drive the way we want, then use model parroting or a similar approach in order to train a much simpler and fully interpretable model, such that we can verify its properties formally. Alternatively, we can use something like constructible AI in place of black-box systems, and prove properties of the composed parts. In each case, Guaranteed Safe AI is not a tool for guaranteeing AI alignment in general, it is a tool for building specific safe systems.

Adversarial Concerns

Once a self-driving car is constrained to follow the rules of the road with some provable reliability, despite failures in its systems, we still need to worry about other concerns. Most critically, we need to consider adversarial attacks and robustness, on two fronts. The first is indirectly malicious adversarial behavior, accidentally or purposefully using the self-driving cars limitations and rule sets to exploit their behavior. These can be severe, such as causing crashes, as discussed here. But even a safe car cannot eliminate such attacks, as mentioned earlier. In fact we would hope that, say, a car driving towards an autonomous vehicle at high speed would cause the autonomous vehicle to move out of the way, even if that meant it instead hits a stationary object or a different car, if it can do so in ways that reduce damage. Less extreme are attacks that cause the car to move to the side of the road unnecessarily, or create other nuisances for drivers. These acts, while unfortunate, are not safety issues.

However, there is a more widely discussed concern that engineered attacks on our otherwise safe system could “hide” stop signs, as has been shownrepeatedly, or perhaps modify other car’s paint so that the other car is not recognized by sensors. This is definitely a concern that robustness research has been working on, and such work is useful. On the other hand, we do not blame human drivers if someone maliciously removes a stop sign; one again, provable safety does not imply that others cannot cause harm. 

We also note that the claim that an adversarial attack was conducted without violating laws, including traffic laws, does not shield the perpetrator from criminal charges including attempted murder, and the “intentional act exception“ for insurance would mean that the perpetrator of such acts would be personally liable, without any insurance protection. Our extant legal system can handle this without any need for specific changes.

Defining Broader Sociotechnological Proveably Safe AI

There are other domains where the types of safety guarantees we want for AI systems are much stronger than simply not causing harm. For example, an AI system that explains how to make bioweapons would not itself have caused harm, it would only have enabled a malicious actor to do so. But what we have shown is that we can build a sociotechnical definition of responsible behavior that is coherent, and can be specified in a given domain. 

In some cases, similar to the self-driving car example, the rule we desire could closely parallel current law for humans. In the bioweapons case, materially assisting in the commission of a bioterror attack would certainly be illegal for humans, and the same could be required for artificial intelligence. Formalizing this is obviously difficult, but it is not equivalent to the effectively impossible task of fully modeling the entire human body and its interaction with some novel pathogen. (To avoid the need, we can allow the false positive rate to be far higher than the false negative rate, as many humans do when unsure if an action is strictly illegal, and only allow things we're resonably sureare safe.) 

But in other cases, such as autonomy risks from AI, we would expect that the rules needed for AI systems would differ substantially from human law, and definitions for what qualifies as safety would need to be developed before proveablity could be meaningful. 

Conclusion

It seems that building provably safe systems in the real world is far from an unsolvable problem, as long as we restrict the problem to solve to something that is clearly defined. We can imagine similar guarantees for AI systems for cyber-security, with protections against privilege escalation attacks performed by the AI, or for “escape” scenarios, with protections against self-exfiltration, or (as suggested but, in my view mistakenly, derided) for DNA synthesis, with guarantees that all synthesis orders were screened for misuse potential. None of these will stop people from doing these unsafely in ways not covered by safety guarantees, nor will they prevent hypothetical strongly superhuman AI from finding devious edge cases, or inventing and exploiting new physics.

And if that is the concern, looking more towards AI alignment, we could even construct systems with formal guarantees that all planning is approved by a specific process, to scaffold proposals like Humans consulting HCH, or stronger alternatives. But to provide these stronger guarantees, we need some fundamental progress in proveability of types which are needed for more prosaic applications like self-driving. And in the meantime, there are many cases which could benefit from the clearer definitions and stronger guarantees of safety that proveably safe AI would provide.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

可证明安全 自动驾驶汽车 AI 安全 风险分析 对抗性问题
相关文章