Aligning advanced AI with human interests

MIRI’s mission is to ensure that the creation of smarter-than-human intelligence has
a positive impact. We aim to make advanced intelligent systems behave as
we intend even in the absence of immediate human supervision.

Agent Foundations Technical Agenda
(High reliability focus)

Machine Learning Technical Agenda
(Error tolerance focus)

Highly Reliable
Agent Design

What does optimal reasoning
look like for resource-bounded agents in the physical world?

MIRI focuses on AI approaches that can be made transparent (e.g., precisely specified decision algorithms, not genetic algorithms), so that humans can understand why AI systems behave as they do. For safety purposes, a mathematical equation defining general intelligence is more desirable than an impressive but poorly-understood code kludge.

Much of our research is therefore aimed at putting theoretical foundations under AI robustness work. We consider settings where traditional decision and probability theory frequently break down: settings where computation is expensive, there is no sharp agent/environment boundary, multiple agents exist, or self-referential reasoning is admitted.


A Formal Solution to the Grain of Truth Problem

in Uncertainty In Artificial Intelligence: Proceedings of the Thirty-Second Conference (2016).

A formal solution to the grain of truth problemA Bayesian agent acting in a multi-agent environment learns to predict the other agents’ policies if its prior assigns positive probability to them (in other words, its prior contains a grain of truth). Finding a reasonably large class of policies that contains the Bayes-optimal policies with respect to this class is known as the grain of truth problem. Only small classes are known to have a grain of truth and the literature contains several related impossibility results.

In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of policies that contains all computable policies as well as Bayes-optimal policies for every lower semicomputable prior over the class. When the environment is unknown, Bayes-optimal agents may fail to act optimally even asymptotically. However, agents based on Thompson sampling converge to play ε-Nash equilibria in arbitrary unknown computable multi-agent environments. While these results are purely theoretical, we show that they can be computationally approximated arbitrarily closely.


Parametric Bounded Löb’s Theorem and Robust Cooperation of Bounded Agents

eprint at arXiv:1602.04184 [cs.GT].

“Parametric bounded Löb’s theorem and robust cooperation of bounded agentsLöb’s theorem and Gödel’s theorem make predictions about the behavior of systems capable of self-reference with unbounded computational resources with which to write and evaluate proofs. However, in the real world, systems capable of self-reference will have limited memory and processing speed, so in this paper we introduce an effective version of Löb’s theorem which is applicable given such bounded resources. These results have powerful implications for the game theory of bounded agents who are able to write proofs about themselves and one another, including the capacity to out-perform classical Nash equilibria and correlated equilibria, attaining mutually cooperative program equilibrium in the Prisoner’s Dilemma. Previous cooperative program equilibria studied by Tennenholtz and Fortnow have depended on tests for program equality, a fragile condition, whereas “Löbian” cooperation is much more robust and agnostic of the opponent’s implementation.


Proof-Producing Reflection for HOL

in Interactive Theorem Proving: 6th International Conference, ITP 2015, Nanjing, China, August 24-27, 2015, Proceedings.

Proof-producing reflection for HOLWe present a reflection principle of the form “If  ⌜𝜑⌝  is provable, then 𝜑” implemented in the HOL4 theorem prover, assuming the existence of a large cardinal. We use the large-cardinal assumption to construct a model of HOL within HOL, and show how to ensure  𝜑  has the same meaning both inside and outside of this model. Soundness of HOL implies that if  ⌜𝜑⌝  is provable, then it is true in this model, and hence 𝜑 holds. We additionally show how this reflection principle can be extended, assuming an infinite hierarchy of large cardinals, to implement model polymorphism, a technique designed for verifying systems with self-replacement functionality.

Error Tolerance
and
Value Learning

How can an advanced learning system be made to accept and
assist with online debugging
and adjustment of its goals?

Using training data to teach advanced AI systems what we value looks more promising than trying to code in everything we care about by hand. However, we know very little about how to discern when training data is unrepresentative of the agent’s future environment, or how to ensure that the agent not only learns about our values but accepts them as its own.

Additionally, rational agents pursuing some goal have an incentive to protect their goal-content. No matter what their current goal is, it will very likely be better served if the agent continues to promote it than if the agent changes goals. This suggests that it may be difficult to improve an agent’s alignment with human interests over time, particularly when the agent is smart enough to model and adapt to its programmers’ goals. Making value learning systems error-tolerant is likely to be necessary for safe online learning.


The Value Learning Problem

presented at the IJCAI 2016 Ethics for Artificial Intelligence workshop.

The Value Learning ProblemA superintelligent machine would not automatically act as intended: it will act as programmed, but the fit between human intentions and written code could be poor. We discuss methods by which a system could be constructed to learn what to value. We highlight open problems specific to inductive value learning (from labeled training data), and raise a number of questions about the construction of systems which model the preferences of their operators and act accordingly.


Corrigibility

presented at the AAAI 2015 Ethics and Artificial Intelligence workshop.

CorrigibilityAs artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shut-down button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shut-down behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

Forecasting

When will highly adaptive and general machine intelligence be invented, and under what circumstances?

In addition to our mathematical research, MIRI investigates important strategic questions. What can (and can’t) we predict about the future of AI? How can we improve our forecasting ability? Which interventions available today appear to be the most beneficial, given what little we do know?


The Ethics of Artificial Intelligence

in The Cambridge Handbook of Artificial Intelligence.

The ethics of artificial intelligenceThe possibility of creating thinking machines raises a host of ethical issues. These questions relate both to ensuring that such machines do not harm humans and other morally relevant beings, and to the moral status of the machines themselves. The first section discusses issues that may arise in the near future of AI. The second section outlines challenges for ensuring that AI operates safely as it approaches humans in its intelligence. The third section outlines how we might assess whether, and in what circumstances, AIs themselves have moral status. In the fourth section, we consider how AIs might differ from humans in certain basic respects relevant to our ethical assessment of them. The final section addresses the issues of creating AIs more intelligent than human, and ensuring that they use their advanced intelligence for good rather than ill.


Formalizing Convergent Instrumental Goals

presented at the AAAI 2016 AI, Ethics and Society workshop.

Formalizing convergent instrumental goalsOmohundro has argued that sufficiently advanced AI systems of any design would, by default, have incentives to pursue a number of instrumentally useful subgoals, such as acquiring more computing power and amassing many resources. Omohundro refers to these as “basic AI drives,” and he, along with Bostrom and others, has argued that this means great care must be taken when designing powerful autonomous systems, because even if they have harmless goals, the side effects of pursuing those goals may be quite harmful. These arguments, while intuitively compelling, are primarily philosophical. In this paper, we provide formal models that demonstrate Omohundro’s thesis, thereby putting mathematical weight behind those intuitive claims.


Intelligence Explosion Microeconomics

MIRI technical report 2013–1.

Intelligence Explosion MicroeconomicsI.J. Good suggested that a sufficiently advanced machine intelligence could build a smarter version of itself, which could in turn build an even smarter version, and that this process could continue to the point of vastly surpassing human capability. How can we model and test this hypothesis?

We identify the key issue as returns on cognitive reinvestment—the ability to invest more computing power, faster computers, or improved cognitive algorithms to yield cognitive labor which produces larger brains, faster brains, or better mind designs. Many phenomena have been claimed as evidence for various positions in this debate, from the observed course of hominid evolution to Moore’s Law to the competence over time of chess programs. This paper explores issues that arise when trying to interpret this evidence in light of Good’s hypothesis, and proposes that the next step in this research is to formalize return-on-investment curves, so that each position can formally state which models they hold to be falsified by historical observations.