Gerwin Klein on Formal Methods

 |   |  Conversations

Gerwin Klein portraitGerwin Klein is a Senior Principal Researcher at NICTA, Australia’s National Centre of Excellence for ICT Research, and Conjoint Associate Professor at the University of New South Wales in Sydney, Australia. He is leading NICTA’s Formal Methods research discipline and was the leader of the L4.verified project that created the first machine-checked proof of functional correctness of a general-purpose microkernel in 2009. He joined NICTA in 2003 after receiving his PhD from Technische Universität München, Germany, where he formally proved type-safety of the Java Bytecode Verifier in the theorem prover Isabelle/HOL.

His research interests are in formal verification, programming languages, and low-level systems. Gerwin has won a number of awards together with his team, among them the 2011 MIT TR-10 award for the top ten emerging technologies world-wide, NICTA’s Richard E. Newton impact award for the kernel verification work, the best paper award from SOSP’09 for the same, and an award for the best PhD thesis in Germany in 2003 for his work on bytecode verification. When he is not proving theorems and working on trustworthy software, he enjoys travelling and dabbling in martial arts and photography. Together with Tobias Nipkow he has just published an online draft of the text book Concrete Semantics that teaches programming language semantics using the Isabelle theorem prover.

Highlights of Klein’s thoughts, from the interview below:

  • Verifying code not designed for verification is very difficult and costly. Such “post-mortem verification” has other disadvantages as well.
  • Program designers can use abstraction, modularity, and explicit architecture choices to help make complex systems transparent to humans.
  • There are two ways probability can play a role in verification: (1) direct probabilistic reasoning, in the logic or in a setting where the program itself is probabilistic, or (2) standard non-probabilistic reasoning paired with subjectively uncertain reasoning about what a guarantee means for the overall probability that the system will work as intended.
  • Truly autonomous systems seem to be a long ways off, but if you could make a system with truly unexpected behavior, then you couldn’t make it safe.

Luke Muehlhauser: In your forthcoming paper “Comprehensive Formal Verification of an OS Microkernel,” you and your co-authors describe the kernel design (seL4) you used to ensure its verification would be tractable. You also discuss the process of keeping the formal proof current “as the requirements, design and implementation of the system [changed] over almost a decade.”

How “micro” is the seL4 microkernel? That is, what standard functionality does it provide, and what functionality (that is common in larger OS kernels) does it not provide?

Gerwin Klein: It is pretty micro: it is a true micro-kernel in the L4 tradition. This means it is very small compared to traditional monolithic OS kernels: Linux has on the order of a few million lines of code, seL4 has on the order of 10,000 lines of C. seL4 provides only the core mechanisms that you need to build an OS. These include threads and scheduling, virtual memory, interrupts and asynchronous signals, as well as synchronous message passing. The maybe most important feature of seL4 is its strong access control mechanism that can be used together with the rest to build isolated components with controlled communication between them. These components could be small isolated device drivers or an entire Linux guest OS, the developer can choose the granularity.

The philosophy is: if a mechanism strictly needs privileged hardware access to be implemented, it is in the kernel, but if it can be implemented using the other mechanisms as a user-level task, it is outside the kernel. A surprising amount of features can be outside: device drivers, for instance, graphics, files and file systems, network stacks, even disk paging is not provided by seL4, but can be implemented on top of it. What you (hopefully) get is that tricky complexity is concentrated in the kernel, and user-level OS components become more modular.

Luke: Is anyone likely to formally verify an OS microkernel of seL4’s size, or any computer program of comparable complexity, if it wasn’t designed from the ground up to be especially amenable to formal verification?

Gerwin: I wouldn’t say it’s impossible to verify code that wasn’t designed for verification, but it is going to increase cost and effort significantly. So currently it’s unlikely that someone would do a similar verification on something that wasn’t designed for it. This could change in 5-10 more years as techniques become more powerful and effort decreases.

But even if it is possible, I don’t think it is desirable. I’ve heard someone call this kind of process “post-mortem verification” and I agree with that general sentiment. Starting verification after everything else is finished is too late. If you find serious problems during the verification, what can you do about them if everything is set in stone already? Also, if you are serious about building a provably correct system or a high-assurance system, you should design and plan it as such. Interestingly, agile methods go into the same direction: they test and verify right from the get-go, even if they don’t use formal methods for it.

If you can factor the cost of verification into your development process, your design choices will be more biased towards making it obvious why the system works correctly. In our experience, this doesn’t just reduce the cost of verification later, but also leads to better and more elegant designs. Of course there is also tension between verification and factors such as performance, but it’s not black and white: you can evaluate cost and make informed trade-offs, and sometimes there are solutions that satisfy both.

Luke: You mention that designing with verification in mind may nudge design choices toward “making it more obvious why the system works correctly” — an important issue I’ve sometimes called “transparency,” as in “transparent to human engineers.” I wrote about this in Transparency and Safety-Critical Systems.

One of the “open questions” I posed in that post was: “How does the transparency of a method change with scale?” Perhaps a small program written with verification in mind will tend to be more transparent than a Bayes net of similar complexity, but do you think this changes with scale? If we’re able to scale up formal methods to verify programs 10,000x bigger than seL4, would it be more or less transparent to human engineers than a Bayes net of comparable complexity? At least we can query the Bayes net to ask “what it believes about X,” whereas we can’t necessarily do so with the formally verified system. (You might also think I’m framing these issues the wrong way!)

Gerwin: What you call transparency probably is correlated closely with complexity, and complexity tends to increase with scale. There are a number of ways for dealing with complexity in traditional system design, for instance abstraction, modularity, and architecture. All of these also help with formal verification and make it easier to scale. For example, abstraction hides implementation complexity behind an interface and makes it possible to talk about a part of a system without knowing all of the system.

A great early example of the power of abstraction is the London underground tube map: it tells you which trains go where and connect with which other trains, and it gives you a rough geographical relation, but it doesn’t tell you which bends the tracks make, where precisely each station is, or what is above ground between stations. You can confidently navigate a fairly complex train system as a passenger without knowing anything at all about the more detailed mechanics of how everything works and where precisely everything is.

In our seL4 verification, we initially couldn’t make much use of abstraction, because in a microkernel everything is tightly interrelated. In larger systems this is less necessary, in part because something like a kernel is then available to provide modules and isolation.

In the longer run, we benefited from abstraction even in the seL4 verification. We first verified that the code correctly implements a higher-level specification, which is an abstraction of the real kernel code. Years later, we used that much simpler abstraction, and, without even looking at the complex C code, could prove very specific security properties about it, such as that seL4 respects integrity and confidentiality. Since we had first proved that this specification was a true abstraction of the code, and because the properties we were interested were preserved by this kind of abstraction, we had automatically gained a verification of the code as well. The win in effort was significant: the verification of integrity was 20 times less expensive than the original code verification.

In my opinion, making serious use of such principles is going to be the way to scale verification. It is already what systems engineers do for complex but critical systems. The way to use it for verification is to stick with it more closely and not to make convenience exceptions that break abstractions or circumvent the system architecture. In the end, everything that makes it easier for humans to think about a system, will help to verify it.

Luke: You wrote that “There are a number of ways for dealing with complexity in traditional system design, for instance abstraction, modularity, and architecture.” You gave examples of verifying complex code using abstraction and modularity — what did you have in mind for using architecture to make verification of complex systems more tractable?

Gerwin: Yes, thanks for spotting that.

There are two main ways I was thinking of in which good architecture can help verification. The first is just the fact that an explicit architecture and design will tell you which modules or layers exist in the system and how they are connected to each other. Making this explicit will enable you to draw some first conclusions about the system without even needing to look at the code. For example, if two components are not connected to each other, and the underlying operating system enforces the architecture correctly, then these two components will not be able to influence each other’s behaviour, and that means you can verify each one in isolation. You now have two simpler verification problems instead of one large one.

The second way comes into play when you build the critical property you are interested in verifying directly into the architecture from the start. For example, imagine a trusted data logging component in a critical device. The properties you might want from this component are: it only accesses its own log files, it only writes data and never reads, and no other component in the system can write to those log files. To verify this property in a monolithic system, you’d either have to prove of all components with file write access that they don’t overwrite the log file, or you would have to prove correct the access control mechanism of a complex file system. Instead of doing all that work, you could put an additional small filter component between any writers and the file system: the filter will deny requests to write to the log file and allow the rest. This is easy to implement and easy to prove correct. You now have reduced a large, potentially infeasible verification task to a (hopefully) small runtime cost and an architecture property.

For any of this to work, you need to make sure that the real code in the end actually adheres to the architecture and that the communication and protection structure of the architecture is enforced by the underlying system. This is why our group is thinking about such security architecture patterns: a formally verified microkernel is the perfect building block for building high assurance systems in this way.

Luke: I’d like to get your perspective on something I previously asked Emil Vassev: To what degree can we demonstrate both deductive guarantees and probabilistic guarantees with today’s formal verification methods? If we could generalize formal specification and verification procedures to allow both for deductive guarantees and probabilistic guarantees, perhaps we could verify larger, more complex, and more diverse programs.

Many of the deductive proofs for safety properties in today’s formally verified systems are already “probabilistic” in the sense that the designers have some subjective uncertainty as to whether the formal specification accurately captures the intuitively desirable safety properties, and (less likely) whether there was an error in the proof somewhere.

But the question I’m trying to ask is about explicitly probabilistic guarantees within the proof itself. Example: the verifier could represent that it ran out of computation before it could prove or disprove whether the system had property X, but given its reasoning and search process thus far it’s 80% confident that the system has property X, kind of like human researchers today can’t prove whether P = NP or not, but they’re (say) subjectively 80% confident that P ≠ NP. Though perhaps to do this, we’d first need a solution to the problem of how a formal system or logical calculus can have internally consistent probabilistic beliefs (and belief updates) about logical sentences (a la Demski 2012 or Hutter et al. 2012).

Gerwin: Emil gave a great answer to this question already, so I’ll try to approach it from a slightly different angle. I see two ways in which probability can play a role in verification: one where you reason about probabilities directly, be it just in the logic or even in a setting where the program itself is probabilistic, or another one where you reason classically, but are trying to figure out what that means for the overall probability that your system will work as intended.

There are good examples for both of these methods, and both are useful to apply, even within the same project. As Emil pointed out, it is preferable to use full deduction and have the ability to draw definite conclusions. But when that is not available, you may be satisfied with at least a probabilistic answer. There are a number of probabilistic model checkers, for instance PRISM from Marta Kwiatkowska’s group in Oxford, or the Probabilistic Consistency Engine (PCE) from SRI. These tools will let you attach probabilities to certain events in a system or attach probabilities to beliefs about the state of the world and reason from there. The Probabilistic Consistency Engine is very close to what you mentioned in the question: reconciling potentially conflicting probabilistic beliefs about the world and drawing probabilistic conclusions from those descriptions.

There are also programs and protocols that directly work with probability, and for instance virtually flip a coin to make certain decisions. The Firewire protocol for instance uses this mechanism, and despite its use of probability you can make definite claims about its behaviour. Carroll Morgan who is working with us at UNSW has been applying probabilistic reasoning about such programs to security questions and program construction. To give you a flavour of what is possible in this space: one of our PhD students, David Cock, has recently used Carroll’s methods in Isabelle to reason about covert channels in microkernels and the maximum bandwidth at which they can leak information.

Finally, there is the question how uncertainty about requirements and uncertainty about the physical world fits formal verification. John Rushby from SRI has a great take on this question in his paper about possibly perfect systems Logic and Epistemology in Safety Cases. His point is that there are two different kinds of probability to consider: the probability that the system is correct, and the underlying uncertainty we have about the world. For example, a utopian fully verified system where we captured all requirements correctly and where we got everything perfect might still fail, if one of its physical components fails from wear and tear or overheating or because someone dropped it into a volcano.

Formal verification does nothing about the latter kind of probability, but it increases the probability that the system is perfect (even if it can never fully get there). We can use this concept to figure out where verification has the largest impact, and quantify how much impact it has. For example, it may not make much sense to formally verify a system where even the best possible solution could only be right about 50% of the time, because the world around it behaves too randomly. On the other hand, we can get a huge pay-off from formal verification in a setting where requirements and system can be captured precisely. Rushby’s approach lets us quantify the difference.

Luke: Levitt (1999) argued that as intelligent systems become increasingly autonomous, and operate in increasingly non-deterministic, unknown environments, it will not be possible to model their environment and possible behaviors entirely with tools like control theory:

If robots are to put to more general uses, they will need to operate without human intervention, outdoors, on roads and in normal industrial and residential environments where unpredictable… events routinely occur. It will not be practical, or even safe, to halt robotic actions whenever the robot encounters an unexpected event or ambiguous [percept].

Currently, commercial robots determine their actions mostly by control-theoretic feedback. Control-theoretic algorithms require the possibilities of what can happen in the world be represented in models embodied in software programs that allow the robot to pre-determine an appropriate action response to any task-relevant occurrence of visual events. When robots are used in open, uncontrolled environments, it will not be possible to provide the robot with a priori models of all the objects and dynamical events that might occur.

In order to decide what actions to take in response to un-modeled, unexpected or ambiguously interpreted events in the world, robots will need to augment their processing beyond controlled feedback response, and engage in decision processes.

Thus, he argues, autonomous programs will eventually need to be decision-theoretic.

What do you think of that prediction? In general, what tools and methods do you think will be most relevant for achieving high safety assurances for the highly autonomous systems of the future?

Gerwin: There’s a lot of work ahead of us in this area, and I don’t think we have very good solutions yet. After all, we’re still struggling to build simpler systems properly.

On the other hand, I agree that more autonomy seems to be the direction that systems are heading towards. In the DARPA HACMS program that Kathleen Fisher is managing we are investigating some approaches to high-assurance autonomous vehicles. One thing you can do for instance, is use a so-called simplex architecture. You use two controllers: one complex, autonomous controller and an additional, simpler safety controller that you can provide high assurance for. During normal operation, the complex autonomous controller runs, and provides convenient, even smart behaviour. The simple safety controller stays in the background and monitors operations. It only checks whether the vehicle is still within safe parameters. Should something unforeseen happen, such as a fault, a security attack, or simply a defect in the complex controller, the simple one takes over and puts the system in a safe mode. For autonomous air vehicles, this could be “fly home to base”, or “land safely in a free spot”. This solution doesn’t work for everything, but it is a good first step.

Whether we will really start building systems that come up with their own unexpected behaviour, based on independent reasoning, or not, I can’t really say. In current autonomous vehicles, like self-driving cars or proposals like delivery drones, this is certainly not yet the case. And I have a hard time imagining regulatory authorities agreeing to something like this (with good reasons). Instead of trying to predict everything, these systems currently are constructed to safely deal with classes of problems. For instance, you don’t need to know precisely which kind of obstacles you will encounter for a self-driving car. You don’t need to know if it is an animal, debris, or a small child. It’s enough to see that something is in your way or is on a collision course and react to this better than humans do. And you err on the side of caution rather than take risks. To get an improvement over the current state of human drivers for instance, you don’t need to build a system that is perfect. You only need to build one that drives a lot safer than humans. That is a much simpler problem.

Systems that do their own reasoning could still be made safe, if you can show that the reasoning is sound and you can explore the implication space of the situations you expect it to be in. The problem is that automated reasoning is hard — not just complicated to think about, but a theoretically hard problem where even simple problems need a lot of computational power. Building systems directly even for complex situations is currently still a lot easier than that, so I’d say that autonomous programs that do their own reasoning are still closer to science fiction than reality. Asimov played with this vision in his famous laws of robotics, which in his stories often lead to surprising, but perfectly valid (according to those laws) behaviours. In the end, if you really want to build a system that can have truly unexpected behaviour, then by definition you cannot verify that it is safe, because you just don’t know what it will do. You’d have to deal with them more like you deal with people than you deal with machines.

Luke: What do you think are some promising “next steps” — say, over the next decade — for research on how to achieve high-assurance AI systems?

Gerwin: There’s a lot of work ahead of us in this area, and I don’t think we have very good solutions yet. After all, we’re still struggling to build simpler systems properly.

I think research such as in the HACMS program is encouraging: high-assurance control and sensing algorithms, high-assurance operating systems, not just kernels, but including the wider system infrastructure, high-level design tools that allow architecture-driven approaches and analysis, and generally stronger verification tools will be the key to make such autonomous systems more reliable and safer.

If we can crack the next order of magnitude in system scale, things will get a lot more interesting. There are some good approaches for this developing: instead of trying to find proofs without context, we have had some initial success in synthesizing more code from specifications and generating the corresponding proofs automatically. If that approach works on a larger scale, it could make building high-assurance systems dramatically cheaper in the future.

Luke: Thanks, Gerwin!