Benja Fallenstein researches mathematical models of human and animal behavior at Bristol University, as part of the MAD research group and the decision-making research group.

Before that, he graduated from University of Vienna with a BSc in Mathematics. In his spare time, Benja studies questions relevant to AI impacts and Friendly AI, including: AI forecasting, intelligence explosion microeconomics, reflection in logic, and decision algorithms.

He has attended two of MIRI’s research workshops, and is scheduled to attend another in December.

**Luke Muehlhauser**: Since you’ve attended two MIRI research workshops on “Friendly AI math,” I’m hoping you can explain to our audience what that work is all about. To provide a concrete example, I’d like to talk about the Löbian obstacle to self-modifying artificial intelligence, which is one of the topics that MIRI’s recent workshops have focused on. To start with, could you explain to our readers what this problem is and why you think it is important?

**Benja Fallenstein**: MIRI’s research is based on I.J. Good’s concept of the intelligence explosion: the idea that once we build an artificial intelligence that’s as good as a human at doing artificial intelligence research, this AI will be able to figure out how to make itself even smarter, and even better at AI research, leading to a runaway process that will eventually create machines far surpassing the capabilities of any human being. When this happens, we really want these machines to have goals that humanity would approve of. True, it’s not very likely that an AI would decide that it wants to rule over us (that’s just anthropomorphism), but most goals that a computer could have would be dangerous to us: For example, imagine a computer that wants to calculate π to as many digits as possible. That computer will see humans as being made of atoms which it could use to build more computers; and worse, since we would object to that and might try to stop it, we’d be a potential threat that it would be in the AI’s interest to eliminate (Omohundro 2008). So we want to make very sure that the end result of an intelligence explosion (after many, many self-improvements) is an AI with “good” goals.

Now you might think that all we need to do is to build our initial AI to have “good” goals. As a toy model, imagine that you program your AI to only take an action x if it can prove that doing x will lead to a “good” outcome. Then the AI won’t self-modify to have “bad” goals, because it won’t be able to prove that this self-modification to a “good” outcome. But on the other hand, you’d think that this AI would be able to self-modify in a way that leaves its goals intact: You’d think that it would be able to reason, “Well, the new version of me will only take an action y if it can prove that this leads to an outcome it likes, and it likes ‘good’ outcomes, just as I do — so whatever it does will lead to a ‘good’ outcome, it’s all fine!”

But here’s the problem: In this chain of reasoning, our AI needs to go from “the new version will only take an action y if it has proven that y leads to a good outcome” to “it will only take y if this actually leads to a good outcome.” Intuitively, this seems like a perfectly reasonable argument; after all, we trust proofs in whatever formal system the AI is using (or we’d have programmed the AI to use a different system), so why shouldn’t the AI do the same? But by Löb’s Theorem, no sufficiently strong formal system can know that everything that it proves to be true is actually true. That’s what we call the “Löbian obstacle.”

**Luke**: You’ve called using mathematical proofs a “toy model,” but that’s exactly how work at the recent MIRI workshops has approached the Löbian obstacle. Do you think that practical AI designs will be based on logic and proofs? How confident are you that the Löbian obstacle will be relevant to a realistic artificial intelligence, and that the work MIRI is currently doing will be applicable in that context?

**Benja**: We certainly don’t think that a realistic AI will literally be able to find a mathematical proof that its actions are guaranteed to lead to “good” outcomes. Any practical AI will be uncertain about many things and will need to use probabilistic reasoning. There are two reasons why I think that MIRI’s current work has a decent chance of being relevant in that setting.

First, Löb’s theorem is only one instance of a “diagonalization argument” placing limits on the degree to which a formal system can do self-reference. For example, there’s Tarski’s theorem that a powerful formal language can’t talk about which sentences in that language are true, because otherwise you could have a formal analogue of the Liar paradox “this sentence is false,” and Turing’s halting problem, which says that there’s no computer program which can say for arbitrary other programs whether they go into an infinite loop. Other well-known examples include Russell’s paradox and Cantor’s argument that not all infinite sets are the same size. Similar arguments apply to simple-minded ways of doing probabilistic reasoning, so I feel that it’s unlikely that the problem will just automatically go away when we start using probability, and I think there is a decent chance that the work we are doing now will lead to insights that are applicable in that setting.

And second, in order to achieve a reasonable probability that our AI still follows the same goals after billions of rewrites, we must have a very low chance of going wrong in every single step, and machine-verified formal mathematical proofs are the one way we know to become extremely confident that something is true (especially statements like “this AI design won’t destroy the world”, where we don’t get to just observe many independent examples). Although you can never be sure that a program will work as intended when run on a real-world computer — it’s always possible that a cosmic ray will hit a transistor and make things go awry — you can prove that a program would satisfy certain properties when run on an ideal computer. Then you can use probabilistic reasoning and error-correcting techniques to make it extremely probable that when run on a real-world computer, your program still satisfies the same property. So it seems likely that a realistic Friendly AI would still have *components* that do logical reasoning or something that looks very much like it.

I tend to think not in terms of the results that we are currently proving being directly relevant to a future AI design; rather, I hope that the work we are currently doing will help us understand the problems better and lead to insights that lead to insights that ultimately allow us to build a safe self-improving machine intelligence.

**Luke**: What sort of historical precedent do we have for doing technical work that we hope will lead to some insights, which will lead to other insights, which will lead to other insights, which will lead to useful application many years later?

I suppose that kind of thing happens in mathematics on occasion, for example when in the 1980s it was discovered that one might be able to prove Fermat’s Last Theorem via the modularity theorem, which prompted Andrew Wiles to pursue this attack, which allowed him to prove Fermat’s Last Theorem after about a decade of work (Singh 1997). Another example was Hamilton’s attack on the Poincaré conjecture via Ricci flow on a manifold, which began in 1982 and led to Perelman’s proof in 2003 (Szpiro 2008). Of course, other conjectures have thus far resisted decades of effort to prove them, for example the Riemann Hypothesis (Rockmore 2007) and P ≠ NP (Fortnow 2013).

But “goal stability under self-modification” isn’t as well-defined as the conjectures by Fermat and Poincaré. Maybe more analogous examples come from the field of computer science? For example, many early AI scientists worked toward the goal of writing a computer program that could play Grandmaster-level chess, even though they couldn’t be sure exactly what such a program would look like. There are probably analogues in quantum computing, too.

But anyway: how do you think about this?

**Benja**: My gut feeling actually tends to be that what we are trying to do here is fairly unusual — and for a good reason: it’s risky. If you want to be sure that what you’re working on isn’t a dead end, you’d certainly want to choose a topic where the gap between our goals and our current knowledge is smaller than in FAI. But I’m worried that if we wait with doing FAI research until we understand how AGIs will work, then there won’t be enough time remaining before the intelligence explosion to actually get the task done, so my current feeling is that the right tradeoff is to start now despite the chance of taking the wrong tack.

But then again, maybe our situation isn’t as unusual as my gut feeling suggests. Depending on how close you want the analogy to be, there may be many examples where scientists have a vague idea of the problem they want to solve, but aren’t able to tackle it directly, so instead they look for a small subproblem that they think they can make some progress on. You could tell a story where much of physics research is ultimately aimed at figuring out the true basic laws of the universe, and yet all a physicist can actually do is work on the next problem in front of them. Surely psychology was aimed from the beginning at figuring out all about how the human mind works, and yet starting by training rats to press a lever to get food, and later following this up by sticking an electrode in the rat’s brain and seeing what neurons are involved in accomplishing this task, can count as insights leading to insights that will plausibly help us figure out what’s really going on. Your own post on “searching under the streetlight” gives some examples of this pattern as well.

**Luke**: Can you say more about why you and some others think this problem of stability under self-modification should be investigated from the perspective of mathematical logic? For example, Stanford graduate student Jacob Steinhardt commented that the first tool he’d reach for to investigate this problem would not be mathematical logic, but instead “a martingale…, which is a statistical process that somehow manages to correlate all of its failures with each other… This can yield bounds on failure probability that hold for extremely long time horizons, even if there is non-trivial stochasticity at every step.”

**Benja**: I said earlier that in order to have a chance that our AI still follows the same goals after billions of rewrites, the probability of going wrong on any particular step must be very, very small. That’s true, but it’s not very quantitative. If we want to have 99% probability of success, how large a risk can we take on any particular step? It would be sufficient if the probability is lower than one in 100 billion each time, but that’s not really necessary. Jacob’s idea of using martingales is a similar but more flexible answer to this question, which allows you to take slightly larger risks under certain circumstances.

But even with this additional flexibility, you will still need a way to achieve extremely high confidence that what you’re doing is safe on most of the rewrite steps. And we can’t just achieve that confidence by the tried-and-true way of running an experiment with a large sample: The question is whether a rewrite our nascent AI is considering early on will lead to the intended results after the AI has become superintelligent and spread through the solar system and beyond — you can’t just simulate that, if you don’t already have these resources yourself!

So we need a way to reason abstractly about how our AI will behave in situations that are completely different from what we can simulate at the moment, and we will need to reach extreme confidence that these abstract conclusions are in fact correct. There is exactly one way we know how to do this, and that’s to use formally verified proofs in mathematical logic.

**Luke**: Suppose John Doe had an intuition that despite the fact that he is not a cognitive system with a logical architecture, he feels like he could make lots of self-modifications while retaining his original goals, if he had enough computational power and plenty of time to reason about whether the next self-modification he’s planning is going to change his goals. If this intuition is justified, then this suggests there are other methods we might use, outside mathematical logic, to ensure a very small probability of goal corruption upon self-modification. What would you say to John?

**Benja**: I’d say that I think he underestimates the difficulty of the problem. Two things:

First, my impression is that a lot of people have an intuition that they are already making self-modifications all the time. But the changes that humans can make with present-day technology don’t change the design of the hardware we’re running on — they pale against the difference between a human and a chimpanzee, and a self-improving AI would very likely end up making much more fundamental changes to its design than the relatively small number of tweaks evolution has applied to our brains in the last five million years.

But second, John might say that even taking this into account, he thinks that given enough time to learn how his brain works, and reasoning very carefully about every single step he’s taking, he should be able to go through a long chain of self-modifications that preserve his values. In this case, I think it’s fairly likely that he’s just wrong. However, I could imagine that a human could in fact succeed in doing this — but not without achieving the same sort of extremely high confidence in each single rewrite step that we want our AI to have, and I think that if a human could manage to achieve this sort of confidence, it would be by … proving mathematical theorems and having the proofs formally checked by a computer!

**Luke**: Yeah, when people say that humans self-modify all the time without changing their goals, I give two replies of my own. First, I point out that people’s goals and values do change pretty often. And second, I point out how little humans actually self-modify. For example I once switched from fundamentalist Christian to scientific naturalist, and this went along with a pretty massive shift in how I process evidence and argument. But throughout that worldview change, my brain was still (e.g.) using the temporal difference reinforcement learning algorithm in my dopaminergic reward system. As far as we know, there were no significant changes to my brain’s core algorithms during that period of transformation. Humans never actually self-modify very much, not like an AI would.

My next question has to do with AI capability. As AI scientists know, logic-based AI is generally far less capable than AI that uses machine learning methods. Is the idea that only a very small part of a future self-modifying AI would have a logical structure (so that it could prove the goodness of modifications to its core algorithms), and the rest of the AI would make use of other methods? Sort of like how small parts of safety-critical software (e.g. for flight control) are written in a structured way such that they are amenable to formal verification, but the rest of the system isn’t necessarily written in ways amenable to formal verification?

**Benja**: I think that your point that people’s values actually do change often is useful for intuition, and I also feel it’s important to point out that these changes are again actually pretty small compared to what could happen if you were to change your brain’s entire architecture. People might switch between being committed environmentalists and feeling that environmentalism is fundamentally misguided, for example, but they don’t tend to become committed triangularists who feel that it’s a moral imperative to make all every-day tools triangular in shape. Both murder and condemnation of murder are human universals, traits that are common to all cultures world-wide; we are talking about changes to our cognitive architecture easily fundamental enough that they could lead to the impulse to non-triangularism and the condemnation of this non-triangularism to become similarly universal.

Yes, I think that logical reasoning would be only one tool in the toolbox of a Friendly AI, and it would use different tools to reason about most things in its environment. Even when reasoning about its own behavior, I would expect the AI to only use logic to prove theorems about how it would behave when run on “ideal” hardware (or hardware that has certain bounds on error rates, etc.), and then use probabilistic reasoning to reason about what happens if it runs on real hardware in the physical world. (But with regard to your analogy to present-day safety-critical software, I want to point out that unlike with present-day safety-critical software, I expect the AI to prove theorems about all of its own component parts, I just don’t expect it to use logic to reason about, say, chairs. Formal verification is difficult and time-consuming, which is the reason why we currently only apply it to small parts of safety-critical systems, but I expect future AIs to be up to the task!)

**Luke**: Hmmm. That’s surprising. My understanding is that formal verification methods do not scale well at all, both due to computational intractability and due to the hours of human labor required to write a correct formal specification against which one could verify a complicated system. Why do you think a future AI would be “up to the task” of proving theorems “about all of its own component parts”?

**Benja**: Well, for one thing, I generally expect future AIs to be smarter than us, and easily able to do intellectual tasks that would take very many hours of human labor; and I don’t expect that they will get tired of the menial task of translating their mathematical “intuitions” into long chains of “boring” lemmas, like humans do.

But more specifically, we humans have an intuitive understanding of why we expect the systems we build to work, and my feeling is that probably a major reason for why it is difficult to translate that understanding into formal proofs is that there’s a mismatch between the way these intuitions are represented in our brain and the way the corresponding notions are represented in a formal proof system. In other words, it seems likely to me that when you build a cognitive architecture from scratch, you could build it to have mathematical “intuitions” about why a certain piece of computer code works that are fairly straightforward to translate into formally verifiable proofs. In fact, similarly to how I would expect an AI to directly manipulate representations of computer code rather than using images and verbal sounds like we humans do, I think it’s likely that an FAI will do much of its reasoning about why a piece of computer code works by directly manipulating representations of formal proofs.

That said, it also often seems to happen that we humans know by experience that a certain algorithm or mathematical trick tends to work on a lot of problems, but we don’t have a full explanation for why this is. I do expect future AIs to have to do reasoning of this type as well, and it does seem quite plausible that an AI might want to apply this type of reasoning to (say) machine learning algorithms that it uses for image processing, where a mistake can be recovered from — though likely not to the code it uses to check that future rewrites will still follow the same goal system! And I still expect the AI to prove theorems about its image processing algorithm, I just expect them to be things like “this algorithm will always complete in at most the following number of time steps” or “this algorithm will be perform correctly under the following assumptions, which are empirically known to be true in a very large number of cases.”

**Luke**: Thanks, Benja!

**Did you like this post?** You may enjoy our other Conversations posts, including: