Aligning advanced AI with human interests

MIRI’s mission is to ensure that the creation of smarter-than-human intelligence has
a positive impact. We aim to make advanced intelligent systems behave as
we intend even in the absence of immediate human supervision.

Agent Foundations Technical Agenda
(High reliability focus)

Machine Learning Technical Agenda
(Error tolerance focus)

Highly Reliable
Agent Design

What does optimal reasoning
look like for resource-bounded agents in the physical world?

MIRI focuses on AI approaches that can be made transparent (e.g., precisely specified decision algorithms, not genetic algorithms), so that humans can understand why AI systems behave as they do. For safety purposes, a mathematical equation defining general intelligence is more desirable than an impressive but poorly-understood code kludge.

Much of our research is therefore aimed at putting theoretical foundations under AI robustness work. We consider settings where traditional decision and probability theory frequently break down: settings where computation is expensive, there is no sharp agent/environment boundary, multiple agents exist, or self-referential reasoning is admitted.

Logical Induction

eprint at arXiv:1609.03543 [cs.AI].

Logical induction

We present a computable algorithm that assigns probabilities to every logical statement in a given formal language, and refines those probabilities over time. We show that it satisfies a number of intuitive desiderata, including: (1) it learns to predict patterns of truth and falsehood in logical statements, often long before having the resources to evaluate the statements, so long as the patterns can be written down in polynomial time; (2) it learns to use appropriate statistical summaries to predict sequences of statements whose truth values appear pseudorandom; and (3) it learns to have accurate beliefs about its own current beliefs, in a manner that avoids the standard paradoxes of self-reference.

These properties and many others follow from a logical induction criterion, which is motivated by a series of stock trading analogies. Roughly speaking, each logical sentence φ is associated with a stock that is worth $1 per share if φ is true and nothing otherwise, and we interpret the belief-state of a logically uncertain reasoner as a set of market prices, where Pn(φ) = 50% means that on day n, shares of φ may be bought or sold from the reasoner for 50¢. The logical induction criterion says (very roughly) that there should not be any polynomial-time computable trading strategy with finite risk tolerance that earns unbounded profits in that market over time.

A Formal Solution to the Grain of Truth Problem

in Uncertainty In Artificial Intelligence: Proceedings of the Thirty-Second Conference (2016).

A formal solution to the grain of truth problemA Bayesian agent acting in a multi-agent environment learns to predict the other agents’ policies if its prior assigns positive probability to them (in other words, its prior contains a grain of truth). Finding a reasonably large class of policies that contains the Bayes-optimal policies with respect to this class is known as the grain of truth problem. Only small classes are known to have a grain of truth and the literature contains several related impossibility results.

In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of policies that contains all computable policies as well as Bayes-optimal policies for every lower semicomputable prior over the class. When the environment is unknown, Bayes-optimal agents may fail to act optimally even asymptotically. However, agents based on Thompson sampling converge to play ε-Nash equilibria in arbitrary unknown computable multi-agent environments. While these results are purely theoretical, we show that they can be computationally approximated arbitrarily closely.

Functional Decision Theory: A New Theory of Instrumental Rationality

eprint at arXiv:1710.05060 [cs.AI].

Functional decision theory: A new theory of instrumental rationalityThis paper describes and motivates a new decision theory known as functional decision theory (FDT), as distinct from causal decision theory and evidential decision theory. Functional decision theorists hold that the normative principle for action is to treat one’s decision as the output of a fixed mathematical function that answers the question, “Which output of this very function would yield the best outcome?” Adhering to this principle delivers a number of benefits, including the ability to maximize wealth in an array of traditional decision-theoretic and game-theoretic problems where CDT and EDT perform poorly. Using one simple and coherent decision rule, functional decision theorists (for example) achieve more utility than CDT on Newcomb’s problem, more utility than EDT on the smoking lesion problem, and more utility than both in Parfit’s hitchhiker problem. In this paper, we define FDT, explore its prescriptions in a number of different decision problems, compare it to CDT and EDT, and give philosophical justifications for FDT as a normative theory of decision-making.

Proof-Producing Reflection for HOL

in Interactive Theorem Proving: 6th International Conference, ITP 2015, Nanjing, China, August 24-27, 2015, Proceedings.

Proof-producing reflection for HOLWe present a reflection principle of the form “If  ⌜𝜑⌝  is provable, then 𝜑” implemented in the HOL4 theorem prover, assuming the existence of a large cardinal. We use the large-cardinal assumption to construct a model of HOL within HOL, and show how to ensure  𝜑  has the same meaning both inside and outside of this model. Soundness of HOL implies that if  ⌜𝜑⌝  is provable, then it is true in this model, and hence 𝜑 holds. We additionally show how this reflection principle can be extended, assuming an infinite hierarchy of large cardinals, to implement model polymorphism, a technique designed for verifying systems with self-replacement functionality.

Error Tolerance
Value Learning

How can an advanced learning system be made to accept and
assist with online debugging
and adjustment of its goals?

Using training data to teach advanced AI systems what we value looks more promising than trying to code in everything we care about by hand. However, we know very little about how to discern when training data is unrepresentative of the agent’s future environment, or how to ensure that the agent not only learns about our values but accepts them as its own.

Additionally, rational agents pursuing some goal have an incentive to protect their goal-content. No matter what their current goal is, it will very likely be better served if the agent continues to promote it than if the agent changes goals. This suggests that it may be difficult to improve an agent’s alignment with human interests over time, particularly when the agent is smart enough to model and adapt to its programmers’ goals. Making value learning systems error-tolerant is likely to be necessary for safe online learning.

The Value Learning Problem

presented at the IJCAI 2016 Ethics for Artificial Intelligence workshop.

The Value Learning ProblemA superintelligent machine would not automatically act as intended: it will act as programmed, but the fit between human intentions and written code could be poor. We discuss methods by which a system could be constructed to learn what to value. We highlight open problems specific to inductive value learning (from labeled training data), and raise a number of questions about the construction of systems which model the preferences of their operators and act accordingly.


presented at the AAAI 2015 Ethics and Artificial Intelligence workshop.

CorrigibilityAs AI systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shut-down button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shut-down behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.


When will highly adaptive and general machine intelligence be invented, and under what circumstances?

In addition to our mathematical research, MIRI investigates important strategic questions. What can (and can’t) we predict about the future of AI? How can we improve our forecasting ability? Which interventions available today appear to be the most beneficial, given what little we do know?

The Ethics of Artificial Intelligence

in The Cambridge Handbook of Artificial Intelligence.

The ethics of artificial intelligenceThe possibility of creating thinking machines raises a host of ethical issues. These questions relate both to ensuring that such machines do not harm humans and other morally relevant beings, and to the moral status of the machines themselves. The first section discusses issues that may arise in the near future of AI. The second section outlines challenges for ensuring that AI operates safely as it approaches humans in its intelligence. The third section outlines how we might assess whether, and in what circumstances, AIs themselves have moral status. In the fourth section, we consider how AIs might differ from humans in certain basic respects relevant to our ethical assessment of them. The final section addresses the issues of creating AIs more intelligent than human, and ensuring that they use their advanced intelligence for good rather than ill.

Formalizing Convergent Instrumental Goals

presented at the AAAI 2016 AI, Ethics and Society workshop.

Formalizing convergent instrumental goalsOmohundro has argued that sufficiently advanced AI systems of any design would, by default, have incentives to pursue a number of instrumentally useful subgoals, such as acquiring more computing power and amassing many resources. Omohundro refers to these as “basic AI drives,” and he, along with Bostrom and others, has argued that this means great care must be taken when designing powerful autonomous systems, because even if they have harmless goals, the side effects of pursuing those goals may be quite harmful. These arguments, while intuitively compelling, are primarily philosophical. In this paper, we provide formal models that demonstrate Omohundro’s thesis, thereby putting mathematical weight behind those intuitive claims.

Intelligence Explosion Microeconomics

MIRI technical report 2013–1.

Intelligence Explosion MicroeconomicsI.J. Good suggested that a sufficiently advanced machine intelligence could build a smarter version of itself, which could in turn build an even smarter version, and that this process could continue to the point of vastly surpassing human capability. How can we model and test this hypothesis?

We identify the key issue as returns on cognitive reinvestment—the ability to invest more computing power, faster computers, or improved cognitive algorithms to yield cognitive labor which produces larger brains, faster brains, or better mind designs. Many phenomena have been claimed as evidence for various positions in this debate, from the observed course of hominid evolution to Moore’s Law to the competence over time of chess programs. This paper explores issues that arise when trying to interpret this evidence in light of Good’s hypothesis, and proposes that the next step in this research is to formalize return-on-investment curves, so that each position can formally state which models they hold to be falsified by historical observations.