July 2016 Newsletter
Research updates
General updates
News and links

 Inspiration for these gyms came in part from Chris Olah and Dario Amodei in a conversation with Rafael. ↩
Research updates
General updates
News and links

Future of Humanity Institute Research Fellow Jan Leike and MIRI Research Fellows Jessica Taylor and Benya Fallenstein have just presented new results at UAI 2016 that resolve a longstanding open problem in game theory: “A formal solution to the grain of truth problem.”
Game theorists have techniques for specifying agents that eventually do well on iterated games against other agents, so long as their beliefs contain a “grain of truth” — nonzero prior probability assigned to the actual game they’re playing. Getting that grain of truth was previously an unsolved problem in multiplayer games, because agents can run into infinite regresses when they try to model agents that are modeling them in turn. This result shows how to break that loop: by means of reflective oracles.
In the process, Leike, Taylor, and Fallenstein provide a rigorous and general foundation for the study of multiagent dilemmas. This work provides a surprising and somewhat satisfying basis for approximate Nash equilibria in repeated games, folding a variety of problems in decision and game theory into a common framework.
The paper’s abstract reads:
A Bayesian agent acting in a multiagent environment learns to predict the other agents’ policies if its prior assigns positive probability to them (in other words, its prior contains a grain of truth). Finding a reasonably large class of policies that contains the Bayesoptimal policies with respect to this class is known as the grain of truth problem. Only small classes are known to have a grain of truth and the literature contains several related impossibility results.
In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of policies that contains all computable policies as well as Bayesoptimal policies for every lower semicomputable prior over the class. When the environment is unknown, Bayesoptimal agents may fail to act optimally even asymptotically. However, agents based on Thompson sampling converge to play εNash equilibria in arbitrary unknown computable multiagent environments. While these results are purely theoretical, we show that they can be computationally approximated arbitrarily closely.
Traditionally, when modeling computer programs that model the properties of other programs (such as when modeling an agent reasoning about a game), the first program is assumed to have access to an oracle (such as a halting oracle) that can answer arbitrary questions about the second program. This works, but it doesn’t help with modeling agents that can reason about each other.
While a halting oracle can predict the behavior of any isolated Turing machine, it cannot predict the behavior of another Turing machine that has access to a halting oracle. If this were possible, the second machine could use its oracle to figure out what the first machineoracle pair thinks it will do, at which point it can do the opposite, setting up a liar paradox scenario. For analogous reasons, two agents with similar resources, operating in realworld environments without any halting oracles, cannot perfectly predict each other in full generality.
Game theorists know how to build formal models of asymmetric games between a weaker player and a stronger player, where the stronger player understands the weaker player’s strategy but not vice versa. For the reasons above, however, games between agents of similar strength have resisted full formalization. As a consequence of this, game theory has until now provided no method for designing agents that perform well on complex iterated games containing other agents of similar strength.
Research updates
General updates
News and links

Google DeepMind Research Scientist Laurent Orseau and MIRI Research Associate Stuart Armstrong have written a new paper on errortolerant agent designs, “Safely interruptible agents.” The paper is forthcoming at the 32nd Conference on Uncertainty in Artificial Intelligence.
Abstract:
Reinforcement learning agents interacting with a complex environment like the real world are unlikely to behave optimally all the time. If such an agent is operating in realtime under human supervision, now and then it may be necessary for a human operator to press the big red button to prevent the agent from continuing a harmful sequence of actions—harmful either for the agent or for the environment—and lead the agent into a safer situation. However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interruptions, for example by disabling the red button — which is an undesirable outcome.
This paper explores a way to make sure a learning agent will not learn to prevent (or seek!) being interrupted by the environment or a human operator. We provide a formal definition of safe interruptibility and exploit the offpolicy learning property to prove that either some agents are already safely interruptible, like Qlearning, or can easily be made so, like Sarsa. We show that even ideal, uncomputable reinforcement learning agents for (deterministic) general computable environments can be made safely interruptible.
Orseau and Armstrong’s paper constitutes a new angle of attack on the problem of corrigibility. A corrigible agent is one that recognizes it is flawed or under development and assists its operators in maintaining, improving, or replacing itself, rather than resisting such attempts.
Research updates
General updates
News and links

I’m happy to announce that MIRI is beginning work on a new research agenda, “value alignment for advanced machine learning systems.” Half of MIRI’s team — Patrick LaVictoire, Andrew Critch, and I — will be spending the bulk of our time on this project over at least the next year. The rest of our time will be spent on our preexisting research agenda.
MIRI’s research in general can be viewed as a response to Stuart Russell’s question for artificial intelligence researchers: “What if we succeed?” There appear to be a number of theoretical prerequisites for designing advanced AI systems that are robust and reliable, and our research aims to develop them early.
Our general research agenda is agnostic about when AI systems are likely to match and exceed humans in general reasoning ability, and about whether or not such systems will resemble presentday machine learning (ML) systems. Recent years’ impressive progress in deep learning suggests that relatively simple neuralnetworkinspired approaches can be very powerful and general. For that reason, we are making an initial inquiry into a more specific subquestion: “What if techniques similar in character to presentday work in ML succeed in creating AGI?”.
Much of this work will be aimed at improving our highlevel theoretical understanding of taskdirected AI. Unlike what Nick Bostrom calls “sovereign AI,” which attempts to optimize the world in longterm and largescale ways, task AI is limited to performing instructed tasks of limited scope, satisficing but not maximizing. Our hope is that investigating task AI from an ML perspective will help give information about both the feasibility of task AI and the tractability of early safety work on advanced supervised, unsupervised, and reinforcement learning systems.
To this end, we will begin by investigating eight relevant technical problems:
I’m happy to announce two new technical results related to the problem of logical uncertainty, perhaps our most significant results from the past year. In brief, these results split the problem of logical uncertainty into two distinct subproblems, each of which we can now solve in isolation. The remaining problem, in light of these results, is to find a unified set of methods that solve both at once.
The solutions for each subproblem are available in two new papers, based on work spearheaded by Scott Garrabrant: “Inductive coherence”^{1} and “Asymptotic convergence in online learning with unbounded delays.”^{2}
To give some background on the problem: Modern probability theory models reasoners’ empirical uncertainty, their uncertainty about the state of a physical environment, e.g., “What’s behind this door?” However, it can’t represent reasoners’ logical uncertainty, their uncertainty about statements like “this Turing machine halts” or “the twin prime conjecture has a proof that is less than a gigabyte long.”^{3}
Roughly speaking, if you give a classical probability distribution variables for statements that could be deduced in principle, then the axioms of probability theory force you to put probability either 0 or 1 on those statements, because you’re not allowed to assign positive probability to contradictions. In other words, modern probability theory assumes that all reasoners know all the consequences of all the things they know, even if deducing those consequences is intractable.
We want a generalization of probability theory that allows us to model reasoners that have uncertainty about statements that they have not yet evaluated. Furthermore, we want to understand how to assign “reasonable” probabilities to claims that are too expensive to evaluate.
Imagine an agent considering whether to use quicksort or mergesort to sort a particular dataset. They might know that quicksort typically runs faster than mergesort, but that doesn’t necessarily apply to the current dataset. They could in principle figure out which one uses fewer resources on this dataset, by running both of them and comparing, but that would defeat the purpose. Intuitively, they have a fair bit of knowledge that bears on the claim “quicksort runs faster than mergesort on this dataset,” but modern probability theory can’t tell us which information they should use and how.^{4}
What does it mean for a reasoner to assign “reasonable probabilities” to claims that they haven’t computed, but could compute in principle? Without probability theory to guide us, we’re reduced to using intuition to identify properties that seem desirable, and then investigating which ones are possible. Intuitively, there are at least two properties we would want logically nonomniscient reasoners to exhibit:
1. They should be able to notice patterns in what is provable about claims, even before they can prove or disprove the claims themselves. For example, consider the claims “this Turing machine outputs an odd number” and “this Turing machine outputs an even number.” A good reasoner thinking about those claims should eventually recognize that they are mutually exclusive, and assign them probabilities that sum to at most 1, even before they can run the relevant Turing machine.
2. They should be able to notice patterns in sentence classes that are true with a certain frequency. For example, they should assign roughly 10% probability to “the 10^{100}th digit of pi is a 7” in lieu of any information about the digit, after observing (but not proving) that digits of pi tend to be uniformly distributed.
MIRI’s work on logical uncertainty this past year can be very briefly summed up as “we figured out how to get these two properties individually, but found that it is difficult to get both at once.” Read more »
Research updates
General updates
News and links
