June 2016 Newsletter
Research updates
General updates
News and links

Research updates
General updates
News and links

Google DeepMind Research Scientist Laurent Orseau and MIRI Research Associate Stuart Armstrong have written a new paper on errortolerant agent designs, “Safely interruptible agents.” The paper is forthcoming at the 32nd Conference on Uncertainty in Artificial Intelligence.
Abstract:
Reinforcement learning agents interacting with a complex environment like the real world are unlikely to behave optimally all the time. If such an agent is operating in realtime under human supervision, now and then it may be necessary for a human operator to press the big red button to prevent the agent from continuing a harmful sequence of actions—harmful either for the agent or for the environment—and lead the agent into a safer situation. However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interruptions, for example by disabling the red button — which is an undesirable outcome.
This paper explores a way to make sure a learning agent will not learn to prevent (or seek!) being interrupted by the environment or a human operator. We provide a formal definition of safe interruptibility and exploit the offpolicy learning property to prove that either some agents are already safely interruptible, like Qlearning, or can easily be made so, like Sarsa. We show that even ideal, uncomputable reinforcement learning agents for (deterministic) general computable environments can be made safely interruptible.
Orseau and Armstrong’s paper constitutes a new angle of attack on the problem of corrigibility. A corrigible agent is one that recognizes it is flawed or under development and assists its operators in maintaining, improving, or replacing itself, rather than resisting such attempts.
Research updates
General updates
News and links

I’m happy to announce that MIRI is beginning work on a new research agenda, “value alignment for advanced machine learning systems.” Half of MIRI’s team — Patrick LaVictoire, Andrew Critch, and I — will be spending the bulk of our time on this project over at least the next year. The rest of our time will be spent on our preexisting research agenda.
MIRI’s research in general can be viewed as a response to Stuart Russell’s question for artificial intelligence researchers: “What if we succeed?” There appear to be a number of theoretical prerequisites for designing advanced AI systems that are robust and reliable, and our research aims to develop them early.
Our general research agenda is agnostic about when AI systems are likely to match and exceed humans in general reasoning ability, and about whether or not such systems will resemble presentday machine learning (ML) systems. Recent years’ impressive progress in deep learning suggests that relatively simple neuralnetworkinspired approaches can be very powerful and general. For that reason, we are making an initial inquiry into a more specific subquestion: “What if techniques similar in character to presentday work in ML succeed in creating AGI?”.
Much of this work will be aimed at improving our highlevel theoretical understanding of taskdirected AI. Unlike what Nick Bostrom calls “sovereign AI,” which attempts to optimize the world in longterm and largescale ways, task AI is limited to performing instructed tasks of limited scope, satisficing but not maximizing. Our hope is that investigating task AI from an ML perspective will help give information about both the feasibility of task AI and the tractability of early safety work on advanced supervised, unsupervised, and reinforcement learning systems.
To this end, we will begin by investigating eight relevant technical problems:
I’m happy to announce two new technical results related to the problem of logical uncertainty, perhaps our most significant results from the past year. In brief, these results split the problem of logical uncertainty into two distinct subproblems, each of which we can now solve in isolation. The remaining problem, in light of these results, is to find a unified set of methods that solve both at once.
The solutions for each subproblem are available in two new papers, based on work spearheaded by Scott Garrabrant: “Inductive coherence”^{1} and “Asymptotic convergence in online learning with unbounded delays.”^{2}
To give some background on the problem: Modern probability theory models reasoners’ empirical uncertainty, their uncertainty about the state of a physical environment, e.g., “What’s behind this door?” However, it can’t represent reasoners’ logical uncertainty, their uncertainty about statements like “this Turing machine halts” or “the twin prime conjecture has a proof that is less than a gigabyte long.”^{3}
Roughly speaking, if you give a classical probability distribution variables for statements that could be deduced in principle, then the axioms of probability theory force you to put probability either 0 or 1 on those statements, because you’re not allowed to assign positive probability to contradictions. In other words, modern probability theory assumes that all reasoners know all the consequences of all the things they know, even if deducing those consequences is intractable.
We want a generalization of probability theory that allows us to model reasoners that have uncertainty about statements that they have not yet evaluated. Furthermore, we want to understand how to assign “reasonable” probabilities to claims that are too expensive to evaluate.
Imagine an agent considering whether to use quicksort or mergesort to sort a particular dataset. They might know that quicksort typically runs faster than mergesort, but that doesn’t necessarily apply to the current dataset. They could in principle figure out which one uses fewer resources on this dataset, by running both of them and comparing, but that would defeat the purpose. Intuitively, they have a fair bit of knowledge that bears on the claim “quicksort runs faster than mergesort on this dataset,” but modern probability theory can’t tell us which information they should use and how.^{4}
What does it mean for a reasoner to assign “reasonable probabilities” to claims that they haven’t computed, but could compute in principle? Without probability theory to guide us, we’re reduced to using intuition to identify properties that seem desirable, and then investigating which ones are possible. Intuitively, there are at least two properties we would want logically nonomniscient reasoners to exhibit:
1. They should be able to notice patterns in what is provable about claims, even before they can prove or disprove the claims themselves. For example, consider the claims “this Turing machine outputs an odd number” and “this Turing machine outputs an even number.” A good reasoner thinking about those claims should eventually recognize that they are mutually exclusive, and assign them probabilities that sum to at most 1, even before they can run the relevant Turing machine.
2. They should be able to notice patterns in sentence classes that are true with a certain frequency. For example, they should assign roughly 10% probability to “the 10^{100}th digit of pi is a 7” in lieu of any information about the digit, after observing (but not proving) that digits of pi tend to be uniformly distributed.
MIRI’s work on logical uncertainty this past year can be very briefly summed up as “we figured out how to get these two properties individually, but found that it is difficult to get both at once.” Read more »
Research updates
General updates
News and links

MIRI Research Fellow Andrew Critch has written a new paper on cooperation between software agents in the Prisoner’s Dilemma, available on arXiv: “Parametric bounded Löb’s theorem and robust cooperation of bounded agents.” The abstract reads:
Löb’s theorem and Gödel’s theorem make predictions about the behavior of systems capable of selfreference with unbounded computational resources with which to write and evaluate proofs. However, in the real world, systems capable of selfreference will have limited memory and processing speed, so in this paper we introduce an effective version of Löb’s theorem which is applicable given such bounded resources. These results have powerful implications for the game theory of bounded agents who are able to write proofs about themselves and one another, including the capacity to outperform classical Nash equilibria and correlated equilibria, attaining mutually cooperative program equilibrium in the Prisoner’s Dilemma. Previous cooperative program equilibria studied by Tennenholtz and Fortnow have depended on tests for program equality, a fragile condition, whereas “Löbian” cooperation is much more robust and agnostic of the opponent’s implementation.
Tennenholtz (2004) showed that cooperative equilibria exist in the Prisoner’s Dilemma between agents with transparent source code. This suggested that a number of results in classical game theory, where it is a commonplace that mutual defection is rational, might fail to generalize to settings where agents have strong guarantees about each other’s conditional behavior.
Tennenholtz’s version of program equilibrium, however, only established that rational cooperation was possible between agents with identical source code. Patrick LaVictoire and other researchers at MIRI supplied the additional result that more robust cooperation was possible between noncomputable agents, and that it is possible to efficiently determine the outcomes of such games. However, some readers objected to the infinitary nature of the methods (for example, the use of halting oracles) and worried that not all of the results would carry over to finite computations.
Critch’s report demonstrates that robust cooperative equilibria exist for bounded agents. In the process, Critch proves a new generalization of Löb’s theorem, and therefore of Gödel’s second incompleteness theorem. This parametric version of Löb’s theorem holds for proofs that can be written out in n or fewer characters, where the parameter n can be set to any number. For more background on the result’s significance, see LaVictoire’s “Introduction to Löb’s theorem in MIRI research.”
The new Löb result shows that bounded agents face obstacles to selfreferential reasoning similar to those faced by unbounded agents, and can also reap some of the same benefits. Importantly, this lemma will likely allow us to discuss many other selfreferential phenomena going forward using finitary examples rather than infinite ones.
Get notified every time a new technical paper is published.
I’m happy to announce that Malo Bourgon, formerly a program management analyst at MIRI, has taken on a new leadership role as our chief operating officer.
As MIRI’s secondincommand, Malo will be taking over a lot of the handson work of coordinating our daytoday activities: supervising our ops team, planning events, managing our finances, and overseeing internal systems. He’ll also be assisting me in organizational strategy and outreach work.
Prior to joining MIRI, Malo studied electrical, software, and systems engineering at the University of Guelph in Ontario. His professional interests included climate change mitigation, and during his master’s, he worked on a project to reduce waste through online detection of inefficient electric motors. Malo started working for us shortly after completing his master’s in early 2012, which makes him MIRI’s longeststanding team member next to Eliezer Yudkowsky.