A Guide to MIRI’s Research
by Nate Soares
Friendly AI theory currently isn’t about implementation, it’s about figuring out how to ask the right questions. Even if we had unlimited finite computing resources and a solid understanding of general intelligence, we still wouldn’t know how to specify a system that would reliably have a positive impact during and after an intelligence explosion. Such is the state of our ignorance.
For now, MIRI’s research program aims to develop solutions that assume access to unbounded finite computing power, not because unbounded solutions are feasible, but in the hope that these solutions will help us understand which questions need to be answered in order to the lay the groundwork for the eventual specification of a Friendly AI. Hence, our current research is primarily in mathematics (as opposed to software engineering or machine learning, as many expect).
This guide outlines the topics that one can study to become able to contribute to one or more of MIRI’s active research areas.
How to use this guide
Perhaps the shortest path to being hired as a MIRI researcher is to study the materials below, then attend the nearest MIRIx workshop a few times, then attend a MIRI workshop or two and show an ability to contribute at the cutting edge. The same path (read these materials, then work your way through some workshops) will also help if you want to research these topics at some other institution.
You can learn most of the requisite material by simply reading all of the textbooks and papers below. However, with all of the material in this guide, please do not grind away for the sake of grinding away. If you already know the material, skip ahead. If one of the active research areas fails to capture your interest, switch to a different one. If you don’t like one of the recommended textbooks, find a better one or skip it entirely. The goal is to get yourself to the front lines with a solid understanding of what our research says. Hopefully, this guide can help you achieve that goal, but don’t let it hinder you!
Finally, a note on content: this guide discusses a number of FAI research subfields. The goal is to overview, rather than motivate, those subfields, for readers who are already extending us charity. We’re hard at work producing a number of documents discussing why these particular subfields are important, but in the meantime, please understand that this guide is not able nor intended to provide strong motivation for these particular problems.
It’s important to have some basic mathematical understanding before jumping directly into the active research topics. All of our research areas are well-served by a basic understanding of computation, logic, and probability theory. Below are some introductory resources to get you started.
You don’t need to go through this section chronologically. Pick up whatever is interesting, and don’t hesitate to skip back and forth between the research areas and the basics as necessary.
It’s also very important to understand the concept of VNM rationality, which I recommend learning from the Wikipedia article but which can also be picked up from the original book. Von Neumann and Morgenstern showed that any agent obeying a few simple consistency axioms acts with preferences characterizable by a utility function. While many expect that we may ultimately need to abandon VNM rationality in order to construct Friendly agents, the VNM framework remains the most expressive framework we have for characterizing the behavior of sufficiently powerful agents. (For example, see the orthogonality thesis and the instrumental convergence thesis from Bostrom’s "The Superintelligent Will.") The concept of VNM rationality is used throughout all our active research areas.
As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.
This field of research is basically brand-new, so all it takes in order to get up to speed is to read a paper or two:
Soares et al.’s "Corrigibility" introduces the field at large, along with a few open problems.
Armstrong’s "Proper value learning through indifference" discusses one potential approach for making agents indifferent between which utility function they maximize, which is a small step towards agents that allow themselves to be modified.
Some early work in corrigibility was done in discussions on the web forum LessWrong. Most of the relevant results are captured in the above papers. However, additional historical work in this area can be read in the following blog posts:
- Cake or Death outlines an example of the “motivated value selection” problem. In this example, an agent with uncertainty about its utility function benefits from avoiding information that reduces its uncertainty.
Utility indifference outlines the original utility indifference idea. It is largely interesting for historical reasons, and is subsumed by Armstrong’s Utility Indifference paper linked above.
Our current work on corrigibility focuses mainly on a small subproblem known as the “shutdown problem:” how do you construct an agent that shuts down upon the press of a shutdown button, and which does not have incentives to cause or prevent the pressing of the button? Within that subproblem, we currently focus on the utility indifference problem: how could you construct an agent which allows you to switch which utility function it maximizes, without giving it incentives to affect whether the switch occurs? Even if we had a satisfactory solution to the utility indifference problem, this would not yield a satisfactory solution to the shutdown problem, as it still seems difficult to adequately specify “shutdown behavior” in a manner that is immune to perverse instantiation. Stuart Armstrong has written a short series of blog posts about the specification of “reduced impact” AGIs:
- The mathematics of reduced impact: help needed
- Domesticating reduced impact AIs
- Reduced impact AI: no back channels
- Reduced impact in practice: randomly sampling the future
An agent undergoing an intelligence explosion may need to execute many self-modifications to its core algorithms and agent architecture, one after the next. Even if the agent at the beginning of this process functioned exactly as planned, if it made a single crucial mistake in choosing one of these rewrites, the end result might be very far from the intended one.
In order to prevent this, we expect that a Friendly system would need some way to limit itself to executing self-modifications only after it has gained extremely high confidence that the resulting system would still be Friendly. Self-confidence of this form, done naively, runs afoul of mathematical problems of self-reference, and it currently seems that formal systems which can gain high self-confidence must walk a fine line between self-trust and unsoundness.
(What do we mean by “high confidence”, “self-confidence”, “self-trust”, and “formal systems”? We don’t quite know yet. Part of the problem is figuring out how to formalize these intuitive concepts in a way that avoids Gödelian pitfalls.)
The study of tiling agents is the study of agents which are able to self-modify in a highly reliable way, specifically via the study of formal systems that can gain some form of confidence in similar systems.
Enderton, A Mathematical Introduction to Logic
MIRI’s existing toy models for studying tiling agents are largely based on first order logic. Understanding this logic and its nuances is crucial in order to understand the existing tools we have developed for studying formal systems capable of something approaching confidence in similar systems.
Yudkowsky & Herreshoff, "Tiling Agents for Self-Modifying AI"
This paper introduces the field of tiling agents, and some of the toy formalisms and partial solutions that MIRI has developed thus far. The paper is a little choppy, but my walkthrough might help make it go down easier.
Yudkowsky, "The procrastination paradox"
The Löbian obstacle (a problem stemming from too little “self trust”) described in the tiling agents paper turns out to be only half the problem: many solutions to the Löbian obstacle run afoul of unsoundnesses that come from too much self-trust. Satisfactory solutions will need to walk a fine line between these two problems.
Christiano et al., "Definability of Truth in Probabilistic Logic"
This describes an early attempt to create a formal system that can reason about itself while avoiding paradoxes of self-reference. It succeeds, but has ultimately been shown to be unsound. My walkthrough for this paper may help put it into a bit more context.
Fallenstein & Soares, "Problems of self-reference in self-improving space-time embedded intelligence"
This describes our simple suggester-verifier model used for studying tiling agents, and demonstrates a toy scenario in which sound agents can successfully tile to (e.g. gain high confidence in) other similar agents.
If you’re really excited about this research topic, there are a number of other relevant tech reports. Unfortunately, most of them don’t explain their motivations well, and have not yet been put into the greater context.
Fallenstein’s "Procrastination in Probabilistic Logic" illustrates how Christiano et al’s probabilistic reasoning system is unsound and vulnerable to the procrastination paradox.
Fallenstein’s "Decreasing mathematical strength…" describes one unsatisfactory property of Parametric Polymorphism, a partial solution to the Lobian obstacle.
Soares’ "Fallenstein’s monster" describes a hackish formal system which avoids the above problem. It also showcases a mechanism for restricting an agent’s goal predicate which can also be used by Parametric Polymorphism to create a less restrictive version of PP than the one explored in the tiling agents paper.
Fallenstein’s "An infinitely descending sequence of sound theories…" describes a more elegant partial solution to the Lobian obstacle, which is now among our favored partial solutions.
Yudkowsky’s "Distributions allowing tiling…" takes some early steps towards probabilistic tiling settings.
An understanding of recursive ordinals provides a useful context from which to understand these results, and can be gained by reading Franzén’s "Transfinite progressions: a second look at completeness."
Imagine a black box, with one input chute and two output chutes. A ball can be put into the input chute, and it will come out of one of the two output chutes. Inside the black box is a Rube Goldberg machine which takes the ball from the input chute to one of the output chutes. A perfect probabilistic reasoner can be uncertain about which output chute will take the ball, but only insofar as they are uncertain about which machine is inside the black box: it is assumed that if they knew the machine (and how it worked) then they would know which chute the ball would come out. It is assumed that probabilistic reasoners are logically omniscient, that they know all logical consequences of the things they know.
In reality, we are not logically omniscient: we can know precisely which machine the box implements and precisely how the machine works, and just not have the time to deduce where the ball comes out. We reason under logical uncertainty. A formal understanding of reasoning under logical uncertainty does not yet exist, but seems necessary in the construction of a safe artificial intelligence. (Self modification involves reasoning about the unknown output of two known programs; it seems difficult to gain confidence in any reasoning system intended to do this sort of reasoning under logical uncertainty before gaining a formal understanding of idealized reasoning under logical uncertainty.)
Unfortunately, the field of logical uncertainty is not yet well-understood, and I am not aware of good textbooks introducing the material. A solid understanding of probability theory is a must; consider augmenting the first few chapters of Jaynes with Feller, chapters 1, 5, 6, and 9.
An overview of the subject can be gained by reading the following papers.
Gaifman, "Concerning measures in first-order calculi." Gaifman started looking at this problem many years ago, and has largely focused on a relevant subproblem, which is the assignment of probabilities to different models of a formal system (assuming that once the model is known, all consequences of that model are known). We are now attempting to expand this approach to a more complete notion of logical uncertainty (where a reasoner can know what the model is but not know the implications of that model), but work by Gaifman is still useful to gain a historical context and an understanding of the difficulties surrounding logical uncertainty. See also
Gaifman & Snir, "Probabilities over rich languages…"
Hutter et al., "Probabilities on sentences in an expressive logic" largely looks at the problem of logical uncertainty assuming access to infinite computing power (and many levels of halting oracles). Again, we take a slightly different approach, asking how an idealized reasoner should handle logical uncertainty given unlimited but finite amounts of computing power. Nevertheless, understanding Hutter’s approach (and what can be done with infinite computing power) helps flesh out one’s understanding of where the difficult questions lie.
Demski, "Logical prior probability" provides an approximately computable logical prior. Following Demski, our work largely focuses on the creation of a prior probability distribution over logical sentences, in the hopes that understanding the creation of logical priors will lead us to a better understanding of how they could be updated, and from there a better understanding of logical uncertainty more generally.
Christiano, "Non-omniscience, probabilistic inference, and metamathematics" largely follows this approach. This paper provides some early practical considerations about the generation of logical priors, and highlights a few open problems.
We do not yet understand a decision algorithm which would, given access to unlimited finite computing power and an arbitrarily accurate world-model, always take the best available action. Intuitively, specifying such an algorithm (with respect to some VNM-rational set of preferences) may seem easy: simply loop through available actions and evaluate the expected utility achieved by taking that action, and then choose the action that yields the highest utility. In practice, however, this is quite difficult: the algorithm is in fact going to choose only one of the available actions, and in order to evaluate what “would have” happened if the agent instead took a different action that it “could have” taken requires a formalization of “would” and “could”: what does it mean to say that a deterministic algorithm “could have had” a different output, and how is this circumstance (which runs counter to the laws of logic and/or physics) evaluated?
Solving this problem will require a better understanding of counterfactual reasoning; this is the domain of decision theory. Modern decision theories do not provide satisfactory methods for counterfactual reasoning, and are insufficient for use in a superintelligence. Existing methods of counterfactual reasoning turn out to be unsatisfactory both in the short term (in the sense that they fail systematically on certain classes of problems) and in the long term (in the sense that self-modifying agents reasoning using bad counterfactuals would, according to those broken counterfactuals, decide that they should not in fact fix all of their flaws): see my talk "Why aint you rich?"
We are currently in the process of writing up an introduction to decision theory as an FAI problem. In the interim, I suggest the following resources in order to understand MIRI’s decision theory research:
Hintze’s "Problem class dominance in predictive dilemmas" contrasts four different normative decision theories: CDT, EDT, TDT, and UDT, and argues that UDT dominates the others on a certain class of decision problems.
Several posts by Yudkowsky and Soares explain why causal counterfactual reasoning is not sufficient for use in an intelligent agent: "Newcomb’s problem and the regret of rationality," "Causal decision theory is unsatisfactory," "An introduction to Newcomblike problems," "Newcomblike problems are the norm."
Alternative decision theories have been developed which are by no means solutions, but which constitute progress. The most promising of these is Updateless Decision Theory, developed by Wei Dai and Vladimir Slepnev among others:
Dai’s "Towards a New Decision Theory" introduces UDT.
Slepnev’s "A model of UDT with a halting oracle" provides an early first formalization.
Fallenstein’s alternative formalization provides a probabilistic formalization.
UDT has a number of problems of its own. Unfortunately, satisfactory write-ups detailing these problems do not yet exist. Two of the open problems have been outlined in blog posts by Vladimir Slepnev:
"An example of self-fulfilling spurious proofs in UDT" explains how UDT can achieve sub-optimal results due to spurious proofs.
"Agent simulates predictor" describes a strange problem wherein it seems as if agents are rewarded for having less intelligence.
A somewhat unsatisfactory solution is discussed in the following write-up by Tsvi Benson-Tilsen:
"UDT with known search order" contains a formalization of UDT with known proof-search order and demonstrates the necessity of playing a technique known as “playing chicken with the universe” in order to avoid spurious proofs.
In order to study multi-agent settings, Patrick LaVictoire has developed a modal agents framework, which has also allowed us to make some novel progress in the field of decision theory. To understand this, you’ll first need to understand provability logic:
In logical toy models of agents reflecting upon systems similar to themselves, the central question is what the parent system can prove about the child system. Our Tiling Agent research makes heavy use of provability logic, which can elegantly express these problems.
This should be sufficient to help you understand the modal agents framework:
Barasz et al’s "Robust cooperation in the Prisoner’s dilemma": roughly, this allows us to consider agents which decide whether or not to cooperate with each other based only upon what they can prove about each other’s behavior. This prevents infinite regress, and in fact, the behavior of two agents which act only according to what they can prove about the behavior of the other can be determined in quadratic time using results from provability logic.
Many open problems in decision theory involve multi-agent settings, and in order to contribute to cutting-edge research it is also important to understand game theory. I have heard good things about the following textbook, but have not read it myself:
Tadelis’ Game Theory: An Introduction.
You also may have luck with Yvain’s Game Theory sequence on LessWrong.
Perhaps the most promising approach for loading values into a powerful AI is to specify a criterion for learning what to value. While this problem dominates the public mindspace with regards to Friendly AI problems (if you could build an FAI, what would you have it do?), we actually find that it is somewhat less approachable than many other important problems (how do you build something stable, how do you verify its decision-making behavior, etc.). That said, a number of papers on value learning exist, and can be used to understand the current state of value learning:
Dewey’s "Learning what to value" discusses the difficulty of the problem.
The orthogonality thesis further motivates why the problem will not be solved by default.
One approach to value learning is Bostrom & Ord’s "parliamentary model," which suggests that value learning is somewhat equivalent to a voter aggregation problem, and that many value learning systems can be modeled as parliamentary voting systems (where the voters are possible utility functions).
MacAskill’s "Normative Uncertainty" provides a framework for discussing normative uncertainty. Be warned, the full work, while containing many insights, is very long. You can get away with skimming parts and/or skipping around some, especially if you’re more excited about other areas of active research.
Fallenstein & Stiennon’s "Loudness" discusses a concern with aggregating utility functions stemming from the fact that the preferences encoded by utility functions are preserved under positive affine transformation (e.g. as the utility function is scaled or shifted). This implies that special care is required in order to normalize the set of possible utility functions.
Owen Cotton-Barratt’s "Geometric reasons for normalising…" discusses the normalization of utility functions.
De Blanc’s "Ontological crises in artificial agents’ value systems" discusses a separate problem in the space of value learning: how are values retained as the system’s model of reality changes drastically? It seems likely that explicit resolution mechanisms will be required, but it is not yet clear how to have an agent learn values in a manner that is robust to ontological shifts.
How should an agent treat itself as if it is a part of the world? How should it learn as if it (and its sensors and its memory) are embedded in the environment, rather than sitting outside the environment? How can an agent make choices when its world-model stops modeling its own action as a fundamentally basic causal node and starts modelling it as a deterministic process resulting from a collection of transistors following physics? Many narrow AI systems assume an agent/environment separation, and we still have some confusion surrounding the nature of learners and actors that treat themselves as part of their environment.
We’ve been referring to this as the problem of “naturalized induction,”. While there has been little research done in this space, here is some reading that can help you better understand the problems:
Bensinger, "Naturalized induction" (series)
Mastery in any subject can be a very powerful tool, especially in the realm of mathematics, where seemingly disjoint topics are actually deeply connected. Many fields of mathematics have the property that if you understand them very very well, then that understanding is useful no matter where you go. With that in mind, while the subjects listed below are not necessary in order to understand MIRI’s active research, an understanding of each of these subjects constitutes an additional tool in the mathematical toolbox that will often prove quite useful when doing new research.
Most math studies either continuous or discrete structures. Many people find discrete mathematics more intuitive, and a solid understanding of discrete mathematics will help you gain a quick handle on the discrete versions of many other mathematical tools such as group theory, topology, and information theory.
Set theory commonly serves as the foundation for modern mathematics, but it’s not the only available foundations. Type theory can also serve as a foundation for mathematics, and in many cases, type theory is a better fit for the problems at hand. Type theory also bridges much of the theoretical gap between computer programs and mathematical proofs, and is therefore often relevant to certain types of AI research.
Program verification techniques allow programmers to become confident that a specific program will actually act according to some specification. (It is, of course, still difficult to validate that the specification describes the intended behavior.) While MIRI’s work is not currently concerned with verifying real-world programs, it is quite useful to understand what modern program verification techniques can and cannot do.
Understanding the mission
Why do this kind of research in the first place? (The first book below is the most important.)