A Guide to MIRI’s Research
by Nate Soares
In order to ensure that the development of smarterthanhuman artificial intelligence has a positive impact, we must meet three formidable challenges: First, we must design smarterthanhuman systems to be highly reliable, such that we can justify confidence that the system will fulfill the specified goals or preferences. Second, the designs must be errortolerant, so that the systems are amenable to online modification and correction in the face of inevitable human error. Third, the system must learn beneficial goals or preferences, so that it has a positive impact on humanity.
MIRI’s current research program focuses on understanding how to meet these challenges in principle: there are aspects of reliable reasoning that we do not yet understand even in theory; there are questions of bounded rationality that we could not yet solve even using unlimited computing resources. Our study focuses on finding any solutions to these problems, practical or not, as a first step. As such, our modern research looks much more like pure mathematics than software engineering or practical machine learning.
This guide briefly overviews our research priorities, and provides resources that will help you get to the cutting edge on each subject area. This guide is not intended to justify these research topics; for further motivation of our approach, refer to our technical agenda and supporting papers.
How to use this guide
Perhaps the shortest path to being hired as a MIRI researcher is to study the materials below, then attend the nearest MIRIx workshop two or three times, then attend a MIRI workshop or two and show an ability to contribute at the cutting edge. The same path (read these materials, then work your way through some workshops) will also help if you want to research these topics at some other institution.
You can learn most of the requisite material by simply reading the textbooks and papers below. However, with all of the material in this guide, please do not grind away for the sake of grinding away. If you already know the material, skip ahead. If one of the active research areas fails to capture your interest, switch to a different one. If you don’t like one of the recommended textbooks, find a better one or skip it entirely. The goal is to get yourself to the front lines with a solid understanding of what our research says. Hopefully, this guide can help you achieve that goal, but don’t let it hinder you!
The basics
It’s important to have some fluency with elementary mathematical concepts before jumping directly into our active research topics. All of our research areas are wellserved by a basic understanding of computation, logic, and probability theory. Below are some resources to get you started.
You don’t need to read the books in this section in the order listed. Pick up whatever is interesting, and don’t hesitate to skip back and forth between the research areas and the basics as necessary.

It’s also important to understand the concept of VNM rationality, which I recommend learning from the Wikipedia article but which can also be picked up from the original book. Von Neumann and Morgenstern showed that any agent obeying a few simple consistency axioms acts with preferences characterizable by a utility function. While some expect that we may ultimately need to abandon VNM rationality in order to construct reliable intelligent agents, the VNM framework remains the most expressive framework we have for characterizing the behavior of arbitrarily powerful agents. (For example, see the orthogonality thesis and the instrumental convergence thesis from Bostrom’s “The Superintelligent Will.”) The concept of VNM rationality is used throughout all our active research areas.
Realistic worldmodels
Formalizing beneficial goals does you no good if your smarterthanhuman system is unreliable. There are aspects of good reasoning that we don’t yet understand, even in principle. It is likely possible to gain insight by building practical systems that use algorithms which seem to work, even if the reasons why they work are not yet wellunderstood: often, theoretical understanding follows in the wake of practical application. However, we consider this approach imprudent when designing systems that have the potential to become superintelligent: we will be safer if we have a theory of general intelligence on hand before attempting to create practical superintelligent systems.
For this reason, many of our active research topics focus on parts of general intelligence that we do not yet understand how to solve, even in principle. For example, consider the following problem:
I have a computer program, known as the “universe.” One function in the universe is undefined. Your job is to provide me with a computer program of the appropriate type to complete my universe program. Then, I’ll run my universe program. My goal is to score your agent according to how well it learns what the original universe program is.
How could I do this? Solomonoff’s theory of inductive inference sheds some light on a theoretical solution: it describes a method for making ideal predictions from observations, but only in the case where the predictor lives outside the environment. Solomonoff induction has led to many useful tools for thinking about inductive inference (including Kolmogorov complexity, the universal prior, and AIXI), but the problem becomes decidedly more difficult in the case where the agent is a subprocess of the universe, computed by the universe.
In the case where the agent is embedded inside the environment, the induction problem gets murky: what counts as “learning the universe program”? Against what distribution over environments should the agent be scored? What constitutes ideal induction in the case where the boundary between “agent” and “environment” becomes blurry? These are questions of “naturalized induction.”
Soares’ “Formalizing two problems of realistic worldmodels” further motivates problems of naturalized induction as relevant to the construction of a theory of general intelligence.
Altair’s “An Intuitive Explanation of Solomonoff Induction” explains Solomonoff’s theory of inductive inference, which is important background knowledge when it comes to understanding open problems of naturalized induction.
Bensinger’s “Naturalized induction” (series) explores questions of naturalized induction in more detail.
Solving problems of naturalized induction requires gaining a better understanding of realistic worldmodels: What is the set of “possible realities”? What sort of priors about the environment would an ideal agent use? Answers to these questions must not only allow good reasoning, they must allow for the specification of human goals in terms of those worldmodels.
For example, in Solomonoff induction (and in Hutter’s AIXI), Turing machines are used to model the environment. Pretend that the only thing we value is diamonds (carbon atoms covalently bound to four other carbon atoms). Now, say I give you a Turing machine. Can you tell me how much diamond is within?
In order to design an agent that pursues goals specified in terms of its world models, the agent must have some way of identifying the ontology of our goals (carbon atoms) inside its world models (Turing machines). This “ontology identification” problem is discussed in “Formalizing Two Problems of Realistic World Models” (linked above), and was first introduced by De Blanc:
De Blanc’s “Ontological crises in artificial agents’ value systems” asks how one might make an agent’s goals robust to changes in ontology. If the agent starts with an atomic model of physics (where carbon atoms are ontologically basic) then this may not be hard. But what happens when the agent builds a nuclear model of physics (where atoms are constructed from neutrons and protons)? If the “carbon recognizer” was hardcoded, the agent might fail to identify any carbon in this new worldmodel, and may start acting strangely (in search of hidden “true carbon”). How could the agent be designed so that it can successfully identify “fourproton atoms” with “carbon atoms” in response to this ontological crisis?
Legg and Hutter’s “Universal Intelligence: A Definition of Machine Intelligence” describes AIXI, a universally intelligent agent in settings where the agent is separate from the environment, and a “scoring metric” used to rate the intelligence of various agent programs in this setting. Hutter’s AIXI and Legg’s scoring metric are very similar in spirit to what we are looking for in response to problems of naturalized induction and ontology identification. The two differences are that AIXI lives in a universe where agent and environment are separated whereas naturalized induction requires a solution where the agent is embedded within the environment, and AIXI maximizes rewards specified in terms of observations whereas we desire a solution that optimizes rewards specified in terms of the outside world.
You can learn more about AIXI in Hutter’s book Universal Artificial Intelligence, although reading Legg’s paper (linked above) is likely sufficient for our purposes.
Decision theory
Say I give you the following: (1) a computer program describing a universe; (2) a computer program describing an agent; (3) a set of actions available to the agent; (4) a set of preferences specified over the history of states that the universe has been in. I task you with identifying the best action available to the agent, with respect to those preferences. For example, your inputs might be:
def Universe(): outcomes = {Lo, Med, Hi} actions = {One, Two, Three} def Agent(): worldmodel = {Lo: One, Hi: Two, Med: Three} return worldmodel[Hi] territory = {Lo: One, Med: Two, Hi: Three} return territory[Agent()]
def Agent(): worldmodel = {Lo: One, Hi: Two, Med: Three} return worldmodel[Hi]
actions = {One, Two, Three}
Hi > Med > Lo
(Notice how the agent is embedded in the environment.) This is another question that we don’t know how to answer, even in principle. It may seem easy: just iterate over each action, figure out which outcome the agent would get if it took that action, and then pick the action that leads to the best outcome. But as a matter of fact, in this thought experiment, the agent is a deterministic subprocess of a deterministic computer program: there is exactly one action that the agent is going to output, and asking what “would happen” if a deterministic part of a deterministic program did something that it doesn’t do is illdefined.
In order to evaluate what “would happen” if the agent took a different action, a “counterfactual environment” (where the agent does something that it doesn’t) must be constructed. Satisfactory theories of counterfactual reasoning do not yet exist. We don’t yet understand how to identify the best action available to an agent embedded within its environment, even in theory, even given full knowledge of the universe and our preferences and given unlimited computing power.
Solving this problem will require a better understanding of counterfactual reasoning; this is the domain of decision theory.
Decision TheoryPeterson’s textbook explain the field of normative decision theory in broad strokes. For a quicker survey, with a stronger focus on Newcomblike problems, see Muehlhauser’s “Decision theory FAQ.” Game TheoryMany open problems in decision theory involve multiagent settings. I have heard good things about Tadelis’ textbook, but have not read it myself. You also may have luck with Scott Alexander’s “Introduction to game theory” on LessWrong. 
Existing methods of counterfactual reasoning turn out to be unsatisfactory both in the short term (in the sense that they systematically achieve poor outcomes on some problems where good outcomes are possible) and in the long term (in the sense that selfmodifying agents reasoning using bad counterfactuals would, according to those broken counterfactuals, decide that they should not fix all of their flaws). My talk “Why ain’t you rich?” briefly touches upon both these points. To learn more, I suggest the following resources:
Soares & Fallenstein’s “Toward idealized decision theory” serves as a general overview, and further motivates problems of decision theory as relevant to MIRI’s research program. The paper discusses the shortcomings of two modern decision theories, and discusses a few new insights in decision theory that point toward new methods for performing counterfactual reasoning.
If “Toward idealized decision theory” moves too quickly, this series of blog posts may be a better place to start:
Yudkowsky’s “The true Prisoner’s Dilemma” explains why cooperation isn’t automatically the ‘right’ or ‘good’ option.
Soares’ “Causal decision theory is unsatisfactory” uses the Prisoner’s Dilemma to illustrate the importance of noncausal connections between decision algorithms.
Yudkowsky’s “Newcomb’s problem and regret of rationality” argues for focusing on decision theories that ‘win,’ not just on ones that seem intuitively reasonable. Soares’ “Introduction to Newcomblike problems” covers similar ground.
Soares’ “Newcomblike problems are the norm” notes that human agents probabilistically model one another’s decision criteria on a routine basis.
MIRI’s research has led to the development of “Updateless Decision Theory” (UDT), a new decision theory which addresses many of the shortcomings discussed above.
Hintze’s “Problem class dominance in predictive dilemmas” summarizes UDT’s dominance over other known decision theories, including Timeless Decision Theory (TDT), another theory that dominates CDT and EDT.
Fallenstein’s “A model of UDT with a concrete prior over logical statements” provides a probabilistic formalization.
However, UDT is by no means a solution, and has a number of shortcomings of its own, discussed in the following places:
Slepnev’s “An example of selffulfilling spurious proofs in UDT” explains how UDT can achieve suboptimal results due to spurious proofs.
BensonTilsen’s “UDT with known search order” is a somewhat unsatisfactory solution. It contains a formalization of UDT with known proofsearch order and demonstrates the necessity of using a technique known as “playing chicken with the universe” in order to avoid spurious proofs.
In order to study multiagent settings, Patrick LaVictoire has developed a modal agents framework, which has also allowed us to use provability logic to make some novel progress in the field of decision theory:
Barasz et al.’s “Robust cooperation in the Prisoner’s Dilemma” allows us to consider agents which decide whether or not to cooperate with each other based only upon what they can prove about each other’s behavior. This prevents infinite regress; in fact, the behavior of two agents which act only according to what they can prove about the behavior of the other can be determined in quadratic time using results from provability logic.
UDT was developed by Wei Dai and Vladimir Slepnev, among others. Dai’s “Towards a new decision theory” introduced the idea, and Slepnev’s “A model of UDT with a halting oracle” provided an early first formalization. Slepnev also described a strange problem with UDT wherein it seems as if agents are rewarded for having less intelligence, in “Agent simulates predictor”.
These blog posts are of historical interest, but nearly all of their content is in ”Toward idealized decision theory”, above.
Logical uncertainty
Imagine a black box, with one input chute and two output chutes. A ball can be put into the input chute, and it will come out of one of the two output chutes. Inside the black box is a Rube Goldberg machine which takes the ball from the input chute to one of the output chutes.
A perfect probabilistic reasoner who doesn’t know which Rube Goldberg machine is in the box doesn’t know how the box will behave, but if they could figure out which machine is inside the box, then they would know which chute would take the ball. This reasoner is environmentally uncertain.
A realistic reasoner might know which machine is in the box, and might know exactly how the machine works, but may lack the deductive capability to figure out where the machine will drop the ball. This reasoner is logically uncertain.
Probability theory assumes logical omniscience; it assumes that reasoners know all consequences of the things they know. In reality, bounded reasoners are not logically omniscient: we can know precisely which machine the box implements and precisely how the machine works, and just not have the time to deduce where the ball comes out. We reason under logical uncertainty.
A formal theory of reasoning under logical uncertainty does not yet exist. Gaining this understanding is extremely important when it comes to constructing a highly reliable generally intelligent system: whenever an agent reasons about the behavior of complex systems, computer programs, or other agents, it must operate under at least a little logical uncertainty.
To understand the state of the art, a solid understanding of probability theory is a must; consider augmenting the first few chapters of Jaynes with Feller, chapters 1, 5, 6, and 9, and then study the following papers:
Soares & Fallenstein’s “Questions of reasoning under logical uncertainty” provides a general introduction, explaining the field of logical uncertainty and motivating its relevance to MIRI’s research program.
Gaifman’s “Concerning measures in firstorder calculi” looked at this problem many years ago. Gaifman has largely focused on a relevant subproblem, which is the assignment of probabilities to different models of a formal system (assuming that once the model is known, all consequences of that model are known). We are now attempting to expand this approach to a more complete notion of logical uncertainty (where a reasoner can know what the model is but not know the implications of that model), but work by Gaifman is still useful to gain a historical context and an understanding of the difficulties surrounding logical uncertainty.
Hutter et al.’s “Probabilities on sentences in an expressive logic” largely looks at the problem of logical uncertainty assuming access to infinite computing power (and many levels of halting oracles). Understanding Hutter’s approach (and what can be done with infinite computing power) helps flesh out our understanding of where the difficult questions lie.
Demski’s “Logical prior probability” provides an computably approximable logical prior. Following Demski, our work largely focuses on the creation of an approximable prior probability distribution over logical sentences, as the act of refining and approximating a logical prior is very similar to the act of reasoning under logical uncertainty in general.
Christiano’s “Nonomniscience, probabilistic inference, and metamathematics” largely follows this approach. This paper provides some early practical considerations about the generation of logical priors, and highlights a few open problems.
For more historical work on this problem, see Gaifman’s “Probabilities over rich languages…” and “Reasoning with limited resources and assigning probabilities to arithmetic statements.”
Vingean reflection
Much of what makes the AI problem unique is that a sufficiently advanced system will be able to do higherquality science and engineering than its human programmers. Many of the possible hazards and benefits of an advanced system stem from its potential to bootstrap itself to higher levels of capability, possibly leading to an intelligence explosion.
If an agent achieves superintelligence via recursive selfimprovement, then the impact of the resulting system depends entirely upon the ability of the initial system to reason reliably about agents that are more intelligent than itself. What sort of reasoning methods could a system use in order to justify extremely high confidence in the behavior of a yet more intelligent system? We refer to this sort of reasoning as “Vingean reflection”, after Vernor Vinge (1993), who noted that it is not possible in general to precisely predict the behavior of agents which are more intelligent than the reasoner.
A reasoner performing Vingean reflection must necessarily reason abstractly about the more intelligent agent. This will almost certainly require some form of highconfidence logically uncertain reasoning, but in lieu of a working theory of logical uncertainty, reasoning about proofs (using formal logic) is the best available formalism for studying abstract reasoning. As such, a modern study of Vingean reflection requires a background in formal logic:
FirstOrder LogicMIRI’s existing toy models for studying selfmodifying agents are largely based on this logic. Understanding the nuances of firstorder logic is crucial for using the tools we have developed for studying formal systems capable of something approaching confidence in similar systems. 
We study Vingean reflection by constructing toy models of agents which are able to gain some form of confidence in highly similar systems. To get to the cutting edge, read the following papers:
Fallenstein & Soares’ “Vingean reflection: Reliable reasoning for selfimproving agents” introduces the field of Vingean reflection, and motivates its connection to MIRI’s research program.
Yudkowsky’s “The procrastination paradox” goes into more detail on the need for satisfactory solutions to walk a fine line between the Löbian obstacle (a problem stemming from too little “selftrust”) and unsoundness that come from too much selftrust.
Christiano et al.’s “Definability of truth in probabilistic logic” describes an early attempt to create a formal system that can reason about itself while avoiding paradoxes of selfreference. It succeeds, but has ultimately been shown to be unsound. My walkthrough for this paper may help put it into a bit more context.
Fallenstein & Soares’ “Problems of selfreference in selfimproving spacetime embedded intelligence” describes our simple suggesterverifier model for studying agents that produce slightly improved versions of themselves, or ’tile’ themselves. The paper demonstrates a toy scenario in which sound agents can successfully tile to (e.g., gain high confidence in) other similar agents.
Yudkowsky & Herreshoff’s “Tiling agents for selfmodifying AI” is an older, choppier introduction to Vingean reflection which may be easier to work through using my walkthrough.
If you’re excited about this research topic, there are a number of other relevant tech reports. Unfortunately, most of them don’t explain their motivations well, and have not yet been put into their greater context.
Fallenstein’s “Procrastination in probabilistic logic” illustrates how Christiano et al.’s probabilistic reasoning system is unsound and vulnerable to the procrastination paradox. Yudkowsky’s “Distributions allowing tiling…” takes some early steps towards probabilistic tiling settings.
Fallenstein’s “Decreasing mathematical strength…” describes one unsatisfactory property of Parametric Polymorphism, a partial solution to the Löbian obstacle. Soares’ “Fallenstein’s monster” describes a hackish formal system which avoids the above problem. It also showcases a mechanism for restricting an agent’s goal predicate which can also be used by Parametric Polymorphism to create a less restrictive version of PP than the one explored in the tiling agents paper. Fallenstein’s “An infinitely descending sequence of sound theories…” describes a more elegant partial solution to the Löbian obstacle, which is now among our favored partial solutions.
An understanding of recursive ordinals provides a useful context from which to understand these results, and can be gained by reading Franzén’s “Transfinite progressions: A second look at completeness.”
Corrigibility
As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.
This field of research is basically brandnew, so all it takes in order to get up to speed is to read a paper or two:
Soares et al.’s “Corrigibility” introduces the field at large, along with a few open problems.
Armstrong’s “Proper value learning through indifference” discusses one potential approach for making agents indifferent between which utility function they maximize, which is a small step towards agents that allow themselves to be modified.
Our current work on corrigibility focuses mainly on a small subproblem known as the “shutdown problem”: how do you construct an agent that shuts down upon the press of a shutdown button, and which does not have incentives to cause or prevent the pressing of the button? Within that subproblem, we currently focus on the utility indifference problem: how could you construct an agent which allows you to switch which utility function it maximizes, without giving it incentives to affect whether the switch occurs? Even if we had a satisfactory solution to the utility indifference problem, this would not yield a satisfactory solution to the shutdown problem, as it still seems difficult to adequately specify “shutdown behavior” in a manner that is immune to perverse instantiation. Stuart Armstrong has written several blog posts about the specification of “reduced impact” AGIs:
These first attempts are not yet a full solution, but they should get you up to speed on our current understanding of the problem.
Early work in corrigibility can be found on the web forum Less Wrong. Most of the relevant results are captured in the above papers. One of the more interesting of these is “Cake or Death”, an example of the “motivated value selection” problem. In this example, an agent with uncertainty about its utility function benefits from avoiding information that reduces its uncertainty.
Armstrong’s “The mathematics of reduced impact: help needed” lists initial ideas for specifying reducedimpact agents, and his “Reduced impact in practice: randomly sampling the future” sketches a simple method for assessing whether a future has been impacted.
Armstrong’s “Utility indifference” outlines the original utility indifference idea, and is largely interesting for historical reasons. It is subsumed by the “Proper value learning through indifference” paper linked above.
Value learning
Since our own understanding of our values is fuzzy and incomplete, perhaps the most promising approach for loading values into a powerful AI is to specify a criterion for the agent to learn our values incrementally. But this presents a number of interesting problems:
Say you construct a training set containing many outcomes filled with happy humans (labeled “good”) and other outcomes filled with sad humans (labeled “bad”). The simplest generalization, from this data, might be that humans really like humanshaped smilingthings: this agent may then try to build many tiny animatronic happylooking people.
Value learning must be an online process: the system must be able to identify ambiguities and raise queries about these ambiguities to the user. It must not only identify cases that it doesn’t know how to classify (such as cases where it cannot tell whether a face looks happy or sad), but also identify dimensions along which the training data gives no information (such as when your training data never shows outcomes filled with humanshaped automatons that look happy, labeled as worthless).
Of course, ambiguity identification alone isn’t enough: you don’t want a system that spends the first three weeks asking for clarification on whether humans are still worthwhile when they are at different elevations, or when the wind is blowing, before finally (after the operations have stopped paying attention) asking whether it’s important that the humanshaped things be acting of their own will.
In order for an agent to reliably learn our intentions, the agent must be constructing and refining a model of its operator and using that model to inform its queries and alter its preferences. To learn more about these problems and others, see the following:
Soares’ “The value learning problem” provides a general overview of a few open problems related to value learning.
Dewey’s “Learning what to value” further discusses the difficulty of value learning.
The orthogonality thesis argues that value learning will not be solved by default.
MacAskill’s “Normative Uncertainty” provides a framework for discussing normative uncertainty. Be warned, the full work, while containing many insights, is very long. You can get away with skimming parts and/or skipping around some, especially if you’re more excited about other areas of active research.
One approach to resolving normative uncertainty is Bostrom & Ord’s “parliamentary model,” which suggests that value learning is somewhat equivalent to a voter aggregation problem, and that many value learning systems can be modeled as parliamentary voting systems (where the voters are possible utility functions).
Owen CottonBarratt’s “Geometric reasons for normalising…” discusses the normalization of utility functions; this is relevant to toy models of reasoning under moral uncertainty.
Fallenstein & Stiennon’s “Loudness” discusses a concern with aggregating utility functions stemming from the fact that the preferences encoded by utility functions are preserved under positive affine transformation (e.g. as the utility function is scaled or shifted). This implies that special care is required in order to normalize the set of possible functions.
Other tools
Mastery in any subject can be a very powerful tool, especially in the realm of mathematics, where seemingly disjoint topics are actually deeply connected. Many fields of mathematics have the property that if you understand them very very well, then that understanding is useful no matter where you go. With that in mind, while the subjects listed below are not necessary in order to understand MIRI’s active research, an understanding of each of these subjects constitutes an additional tool in the mathematical toolbox that will often prove quite useful when doing new research.
Discrete MathMost math studies either continuous or discrete structures. Many people find discrete mathematics more intuitive, and a solid understanding of discrete mathematics will help you gain a quick handle on the discrete versions of many other mathematical tools, such as group theory, topology, and information theory. Type TheorySet theory commonly serves as the foundation for modern mathematics, but it’s not the only available candidate. Type theory can also serve as a foundation for mathematics, and in many cases, type theory is a better fit for the problems at hand. Type theory also bridges much of the theoretical gap between computer programs and mathematical proofs, and is therefore often relevant to certain types of AI research. Program VerificationProgram verification techniques allow programmers to become confident that a specific program will actually act according to some specification. (It is, of course, still difficult to validate that the specification describes the intended behavior.) While MIRI’s work is not currently concerned with verifying realworld programs, it is quite useful to understand what modern program verification techniques can and cannot do. 
Understanding the mission
Why do this kind of research in the first place? (The first book below is the most important.)
