New paper: “A formal solution to the grain of truth problem”

 |   |  Papers

A Formal Solution to the Grain of Truth Problem

Future of Humanity Institute Research Fellow Jan Leike and MIRI Research Fellows Jessica Taylor and Benya Fallenstein have just presented new results at UAI 2016 that resolve a longstanding open problem in game theory: “A formal solution to the grain of truth problem.”

Game theorists have techniques for specifying agents that eventually do well on iterated games against other agents, so long as their beliefs contain a “grain of truth” — nonzero prior probability assigned to the actual game they’re playing. Getting that grain of truth was previously an unsolved problem in multiplayer games, because agents can run into infinite regresses when they try to model agents that are modeling them in turn. This result shows how to break that loop: by means of reflective oracles.

In the process, Leike, Taylor, and Fallenstein provide a rigorous and general foundation for the study of multi-agent dilemmas. This work provides a surprising and somewhat satisfying basis for approximate Nash equilibria in repeated games, folding a variety of problems in decision and game theory into a common framework.

The paper’s abstract reads:

A Bayesian agent acting in a multi-agent environment learns to predict the other agents’ policies if its prior assigns positive probability to them (in other words, its prior contains a grain of truth). Finding a reasonably large class of policies that contains the Bayes-optimal policies with respect to this class is known as the grain of truth problem. Only small classes are known to have a grain of truth and the literature contains several related impossibility results.

In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of policies that contains all computable policies as well as Bayes-optimal policies for every lower semicomputable prior over the class. When the environment is unknown, Bayes-optimal agents may fail to act optimally even asymptotically. However, agents based on Thompson sampling converge to play ε-Nash equilibria in arbitrary unknown computable multi-agent environments. While these results are purely theoretical, we show that they can be computationally approximated arbitrarily closely.

Traditionally, when modeling computer programs that model the properties of other programs (such as when modeling an agent reasoning about a game), the first program is assumed to have access to an oracle (such as a halting oracle) that can answer arbitrary questions about the second program. This works, but it doesn’t help with modeling agents that can reason about each other.

While a halting oracle can predict the behavior of any isolated Turing machine, it cannot predict the behavior of another Turing machine that has access to a halting oracle. If this were possible, the second machine could use its oracle to figure out what the first machine-oracle pair thinks it will do, at which point it can do the opposite, setting up a liar paradox scenario. For analogous reasons, two agents with similar resources, operating in real-world environments without any halting oracles, cannot perfectly predict each other in full generality.

Game theorists know how to build formal models of asymmetric games between a weaker player and a stronger player, where the stronger player understands the weaker player’s strategy but not vice versa. For the reasons above, however, games between agents of similar strength have resisted full formalization. As a consequence of this, game theory has until now provided no method for designing agents that perform well on complex iterated games containing other agents of similar strength.

Read more »

June 2016 Newsletter

 |   |  Newsletters

Research updates

General updates

News and links

New paper: “Safely interruptible agents”

 |   |  Papers

Safely Interruptible AgentsGoogle DeepMind Research Scientist Laurent Orseau and MIRI Research Associate Stuart Armstrong have written a new paper on error-tolerant agent designs, “Safely interruptible agents.” The paper is forthcoming at the 32nd Conference on Uncertainty in Artificial Intelligence.


Reinforcement learning agents interacting with a complex environment like the real world are unlikely to behave optimally all the time. If such an agent is operating in real-time under human supervision, now and then it may be necessary for a human operator to press the big red button to prevent the agent from continuing a harmful sequence of actions—harmful either for the agent or for the environment—and lead the agent into a safer situation. However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interruptions, for example by disabling the red button — which is an undesirable outcome.

This paper explores a way to make sure a learning agent will not learn to prevent (or seek!) being interrupted by the environment or a human operator. We provide a formal definition of safe interruptibility and exploit the off-policy learning property to prove that either some agents are already safely interruptible, like Q-learning, or can easily be made so, like Sarsa. We show that even ideal, uncomputable reinforcement learning agents for (deterministic) general computable environments can be made safely interruptible.

Orseau and Armstrong’s paper constitutes a new angle of attack on the problem of corrigibility. A corrigible agent is one that recognizes it is flawed or under development and assists its operators in maintaining, improving, or replacing itself, rather than resisting such attempts.

Read more »

May 2016 Newsletter

 |   |  Newsletters

Research updates

General updates

News and links

A new MIRI research program with a machine learning focus

 |   |  MIRI Strategy

I’m happy to announce that MIRI is beginning work on a new research agenda, “value alignment for advanced machine learning systems.” Half of MIRI’s team — Patrick LaVictoire, Andrew Critch, and I — will be spending the bulk of our time on this project over at least the next year. The rest of our time will be spent on our pre-existing research agenda.

MIRI’s research in general can be viewed as a response to Stuart Russell’s question for artificial intelligence researchers: “What if we succeed?” There appear to be a number of theoretical prerequisites for designing advanced AI systems that are robust and reliable, and our research aims to develop them early.

Our general research agenda is agnostic about when AI systems are likely to match and exceed humans in general reasoning ability, and about whether or not such systems will resemble present-day machine learning (ML) systems. Recent years’ impressive progress in deep learning suggests that relatively simple neural-network-inspired approaches can be very powerful and general. For that reason, we are making an initial inquiry into a more specific subquestion: “What if techniques similar in character to present-day work in ML succeed in creating AGI?”.

Much of this work will be aimed at improving our high-level theoretical understanding of task-directed AI. Unlike what Nick Bostrom calls “sovereign AI,” which attempts to optimize the world in long-term and large-scale ways, task AI is limited to performing instructed tasks of limited scope, satisficing but not maximizing. Our hope is that investigating task AI from an ML perspective will help give information about both the feasibility of task AI and the tractability of early safety work on advanced supervised, unsupervised, and reinforcement learning systems.

To this end, we will begin by investigating eight relevant technical problems:

Read more »

New papers dividing logical uncertainty into two subproblems

 |   |  Papers

I’m happy to announce two new technical results related to the problem of logical uncertainty, perhaps our most significant results from the past year. In brief, these results split the problem of logical uncertainty into two distinct subproblems, each of which we can now solve in isolation. The remaining problem, in light of these results, is to find a unified set of methods that solve both at once.

The solutions for each subproblem are available in two new papers, based on work spearheaded by Scott Garrabrant: “Inductive coherence1 and “Asymptotic convergence in online learning with unbounded delays.”2

To give some background on the problem: Modern probability theory models reasoners’ empirical uncertainty, their uncertainty about the state of a physical environment, e.g., “What’s behind this door?” However, it can’t represent reasoners’ logical uncertainty, their uncertainty about statements like “this Turing machine halts” or “the twin prime conjecture has a proof that is less than a gigabyte long.”3

Roughly speaking, if you give a classical probability distribution variables for statements that could be deduced in principle, then the axioms of probability theory force you to put probability either 0 or 1 on those statements, because you’re not allowed to assign positive probability to contradictions. In other words, modern probability theory assumes that all reasoners know all the consequences of all the things they know, even if deducing those consequences is intractable.

We want a generalization of probability theory that allows us to model reasoners that have uncertainty about statements that they have not yet evaluated. Furthermore, we want to understand how to assign “reasonable” probabilities to claims that are too expensive to evaluate.

Imagine an agent considering whether to use quicksort or mergesort to sort a particular dataset. They might know that quicksort typically runs faster than mergesort, but that doesn’t necessarily apply to the current dataset. They could in principle figure out which one uses fewer resources on this dataset, by running both of them and comparing, but that would defeat the purpose. Intuitively, they have a fair bit of knowledge that bears on the claim “quicksort runs faster than mergesort on this dataset,” but modern probability theory can’t tell us which information they should use and how.4

What does it mean for a reasoner to assign “reasonable probabilities” to claims that they haven’t computed, but could compute in principle? Without probability theory to guide us, we’re reduced to using intuition to identify properties that seem desirable, and then investigating which ones are possible. Intuitively, there are at least two properties we would want logically non-omniscient reasoners to exhibit:

1. They should be able to notice patterns in what is provable about claims, even before they can prove or disprove the claims themselves. For example, consider the claims “this Turing machine outputs an odd number” and “this Turing machine outputs an even number.” A good reasoner thinking about those claims should eventually recognize that they are mutually exclusive, and assign them probabilities that sum to at most 1, even before they can run the relevant Turing machine.

2. They should be able to notice patterns in sentence classes that are true with a certain frequency. For example, they should assign roughly 10% probability to “the 10100th digit of pi is a 7” in lieu of any information about the digit, after observing (but not proving) that digits of pi tend to be uniformly distributed.

MIRI’s work on logical uncertainty this past year can be very briefly summed up as “we figured out how to get these two properties individually, but found that it is difficult to get both at once.” Read more »

  1. This work was originally titled “Uniform coherence”. This post has been updated to reflect the new terminology. 
  2. Garrabrant’s IAFF forum posts provide a record of how these results were originally developed, as a response to Ray Solomonoff’s theory of algorithmic probability. Concrete Failure of the Solomonoff Approach and The Entangled Benford Test lay groundwork for the “Asymptotic convergence…” problem, a limited early version of which was featured in the “Asymptotic logical uncertainty and the Benford test” report. Inductive coherence is defined in Uniform Coherence 2, and an example of an inductively coherent predictor is identified in The Modified Demski Prior is Uniformly Coherent
  3. This type of uncertainty is called “logical uncertainty” mainly for historical reasons. I think of it like this: We care about agents’ ability to reason about software systems, e.g., “this program will halt.” Those claims can be expressed in sentences of logic. The question “what probability does the agent assign to this machine halting?” then becomes “what probability does this agent assign to this particular logical sentence?” The truth of these statements could be determined in principle, but the agent may not have the resources to compute the answers in practice. 
  4. For more background on logical uncertainty, see Gaifman’s “Concerning measures in first-order calculi,” Garber’s “Old evidence and logical omniscience in Bayesian confirmation theory,” Hutter, Lloyd, Ng, and Uther’s “Probabilities on sentences in an expressive logic,” and Aaronson’s “Why philosophers should care about computational complexity.” 

April 2016 Newsletter

 |   |  Newsletters

Research updates

General updates

News and links

New paper on bounded Löb and robust cooperation of bounded agents

 |   |  Papers

Robust CooperationMIRI Research Fellow Andrew Critch has written a new paper on cooperation between software agents in the Prisoner’s Dilemma, available on arXiv: “Parametric bounded Löb’s theorem and robust cooperation of bounded agents.” The abstract reads:

Löb’s theorem and Gödel’s theorem make predictions about the behavior of systems capable of self-reference with unbounded computational resources with which to write and evaluate proofs. However, in the real world, systems capable of self-reference will have limited memory and processing speed, so in this paper we introduce an effective version of Löb’s theorem which is applicable given such bounded resources. These results have powerful implications for the game theory of bounded agents who are able to write proofs about themselves and one another, including the capacity to out-perform classical Nash equilibria and correlated equilibria, attaining mutually cooperative program equilibrium in the Prisoner’s Dilemma. Previous cooperative program equilibria studied by Tennenholtz and Fortnow have depended on tests for program equality, a fragile condition, whereas “Löbian” cooperation is much more robust and agnostic of the opponent’s implementation.

Tennenholtz (2004) showed that cooperative equilibria exist in the Prisoner’s Dilemma between agents with transparent source code. This suggested that a number of results in classical game theory, where it is a commonplace that mutual defection is rational, might fail to generalize to settings where agents have strong guarantees about each other’s conditional behavior.

Tennenholtz’s version of program equilibrium, however, only established that rational cooperation was possible between agents with identical source code. Patrick LaVictoire and other researchers at MIRI supplied the additional result that more robust cooperation was possible between non-computable agents, and that it is possible to efficiently determine the outcomes of such games. However, some readers objected to the infinitary nature of the methods (for example, the use of halting oracles) and worried that not all of the results would carry over to finite computations.

Critch’s report demonstrates that robust cooperative equilibria exist for bounded agents. In the process, Critch proves a new generalization of Löb’s theorem, and therefore of Gödel’s second incompleteness theorem. This parametric version of Löb’s theorem holds for proofs that can be written out in n or fewer characters, where the parameter n can be set to any number. For more background on the result’s significance, see LaVictoire’s “Introduction to Löb’s theorem in MIRI research.”

The new Löb result shows that bounded agents face obstacles to self-referential reasoning similar to those faced by unbounded agents, and can also reap some of the same benefits. Importantly, this lemma will likely allow us to discuss many other self-referential phenomena going forward using finitary examples rather than infinite ones.



Sign up to get updates on new MIRI technical results

Get notified every time a new technical paper is published.