A Guide to MIRI’s Research
by Nate Soares
Update June 2022: As noted in the 2019 update below, this research guide has only been lightly updated since 2015. We’re also currently doing less hiring (though not zero hiring), and are not currently running AIRCS workshops (though we may run more in the future).
If you’re interested in contributing to the alignment problem, we recommend starting with the Alignment Research Field Guide, How To Get Into Independent Research On Alignment/Agency, and the resources on the Late 2021 MIRI Conversations page.
If you have additional questions about how to get involved, we recommend contacting Buck Shlegeris of Redwood Research or posting on LessWrong.
Update March 2019: This research guide has been only lightly updated since 2015. Our new recommendation for people who want to work on the AI alignment problem is:
- If you have a computer science or software engineering background: Apply to attend our new workshops on AI risk and to work as an engineer at MIRI. For this purpose, you don’t need any prior familiarity with our research.
- If you aren’t sure whether you’d be a good fit for an AI risk workshop, or for an engineer position, shoot us an email and we can talk about whether it makes sense.
- You can find out more about our engineering program in our 2018 strategy update.
- If you aren’t sure whether you’d be a good fit for an AI risk workshop, or for an engineer position, shoot us an email and we can talk about whether it makes sense.
- If you’d like to learn more about the problems we’re working on (regardless of your answer to the above): See “Embedded Agency” for an introduction to our agent foundations research, and see our Alignment Research Field Guide for general recommendations on how to get started in AI safety.
- After checking out those two resources, you can use the links and references in “Embedded Agency” and on this page to learn more about the topics you want to drill down on. If you want a particular problem set to focus on, we suggest Scott Garrabrant’s “Fixed Point Exercises.” As Scott notes:
Sometimes people ask me what math they should study in order to get into agent foundations. My first answer is that I have found the introductory class in every subfield to be helpful, but I have found the later classes to be much less helpful. My second answer is to learn enough math to understand all fixed point theorems.
- If you want people to collaborate and discuss with, we suggest starting or joining a MIRIx group, posting on LessWrong, applying for our AI Risk for Computer Scientists workshops, or otherwise letting us know you’re out there.
- After checking out those two resources, you can use the links and references in “Embedded Agency” and on this page to learn more about the topics you want to drill down on. If you want a particular problem set to focus on, we suggest Scott Garrabrant’s “Fixed Point Exercises.” As Scott notes:
If humans are to develop smarter-than-human artificial intelligence that has a positive impact, we must meet three formidable challenges. First, we must design smarter-than-human systems that are highly reliable, so that we can justify confidence that the system will fulfill the specified goals or preferences. Second, the designs must be error-tolerant, so that the systems are amenable to online modification and correction in the face of inevitable human error. Third, the system must actually learn beneficial goals or preferences.
MIRI’s current research program focuses on understanding how to meet these challenges in principle. There are aspects of reliable reasoning that we do not yet understand even in theory; there are questions of bounded rationality that we could not yet solve even in simplified settings. Our study focuses on finding solutions in simplified settings, as a first step. As such, our modern research looks much more like pure mathematics than software engineering or practical machine learning.
This guide briefly overviews our research priorities, and provides resources that will help you get to the cutting edge on each subject area. This guide is not intended to justify these research topics; for further motivation of our approach, refer to the article “MIRI’s Approach”, or to our technical agenda and supporting papers.
Note (Sep 2016): This research guide is based around our Agent Foundations agenda. As of 2016, we also have a machine learning focused agenda. Refer to that document for more information about research directions that we think are promising, and which are not covered by this guide.
How to Use This Guide
This guide is intended for aspiring researchers who are not yet well-versed in the relevant subject areas. If you are already an AI professional or a seasoned mathematician, consider skipping to our existing publications instead. (Our technical agenda is a fine starting point.) This guide is geared towards students who are wondering what to study if they want to become MIRI researchers in the future, and toward professionals in other fields who want to get up to speed on our work.
Researchers generally end up joining our team by one of two paths. The first is to attend a MIRI workshop and build a relationship with us in person. You can use this form to apply to attend a research workshop. Be warned that there is often quite a bit of time between workshops, and that they have limited capacity.
The second path is to make some progress on our research agenda independently and let us know about your results. You can use our online form to apply for assistance or input on your work, but the fastest way to start contributing is to read posts on the Intelligent Agent Foundations Forum (IAFF), note the open problems people are working on, and solve one. You can then post your results as a link on the forum.
(Update March 2019: LessWrong and the AI Alignment Forum are now our go-to venues for public discussion of AI alignment problems, superceding IAFF. See the top of this post for other updates to the suggestions in this section.)
The primary purpose of the research forum is for researchers who are already on the same page to discuss unpolished partial results. As such, posts on the forum can be quite opaque. This research guide can help you get up to speed on the open problems being discussed on the IAFF. It can also help you develop the skills necessary to qualify for a workshop, or find ways to work on open problems in AI alignment at other institutions.
This guide begins with recommendations for basic subjects that it’s important to understand before attempting this style of research, such as probability theory. After that, it’s broken into a series of topic areas, with links to papers that will catch you up to the state of the art in that area.
This is not a linear guide: if you want to become a MIRI researcher, I recommend first making sure that you understand the basics, and then choosing one topic that interests you and going into depth in that area. Once you understand one topic well, you’ll be ready to try contributing in that topic area on the IAFF.
With all of the material in this guide, please do not grind away for the sake of grinding away. If you already know the material, skip ahead. If one of the active research areas fails to capture your interest, switch to a different one. If you don’t like one of the recommended textbooks, find a better one or skip it entirely. This guide should serve as a tool for figuring out where you can contribute, not as an obstacle to that goal.
The Basics
It’s important to have some fluency with elementary mathematical concepts before jumping directly into our active research topics. All of our research areas are well-served by a basic understanding of computation, logic, and probability theory. Below are some resources to get you started.
You don’t need to read the books in this section in the order listed. Pick up whatever is interesting, and don’t hesitate to skip back and forth between the research areas and the basics as necessary.
StatisticsFluency with statistical modeling will be helpful for contributing to our “Alignment for Advanced Machine Learning” research agenda. Some prior familiarity with probabilistic reasoning is a good idea here. Machine LearningTo develop a practical familiarity with machine learning, we highly recommend Andrew Ng’s Coursera course (lecture notes here). For a more theoretical introduction to ML, try Understanding Machine Learning. |
It’s also important to understand the concept of VNM rationality, which I recommend learning from the Wikipedia article but which can also be picked up from the original book. Von Neumann and Morgenstern showed that any agent obeying a few simple consistency axioms acts with preferences characterizable by a utility function. While some expect that we may ultimately need to abandon VNM rationality in order to construct reliable intelligent agents, the VNM framework remains the most expressive framework we have for characterizing the behavior of arbitrarily powerful agents. (For example, see the orthogonality thesis and the instrumental convergence thesis from Bostrom’s “The Superintelligent Will.”) The concept of VNM rationality is used throughout all our active research areas.
Realistic World-Models
Formalizing beneficial goals does you no good if your smarter-than-human system is unreliable. There are aspects of good reasoning that we don’t yet understand, even in principle. It is likely possible to gain insight by building practical systems that use algorithms which seem to work, even if the reasons why they work are not yet well-understood: often, theoretical understanding follows in the wake of practical application. However, we consider this approach imprudent when designing systems that have the potential to become superintelligent: we will be safer if we have a theory of general intelligence on hand before attempting to create practical superintelligent systems.
For this reason, many of our active research topics focus on parts of general intelligence that we do not yet understand how to solve, even in principle. For example, consider the following problem:
I have a computer program, known as the “universe.” One function in the universe is undefined. Your job is to provide me with a computer program of the appropriate type to complete my universe program. Then, I’ll run my universe program. My goal is to score your agent according to how well it learns what the original universe program is.
How could I do this? Solomonoff’s theory of inductive inference sheds some light on a theoretical solution: it describes a method for making ideal predictions from observations, but only in the case where the predictor lives outside the environment. Solomonoff induction has led to many useful tools for thinking about inductive inference (including Kolmogorov complexity, the universal prior, and AIXI), but the problem becomes decidedly more difficult in the case where the agent is a subprocess of the universe, computed by the universe.
In the case where the agent is embedded inside the environment, the induction problem gets murky: what counts as “learning the universe program”? Against what distribution over environments should the agent be scored? What constitutes ideal induction in the case where the boundary between “agent” and “environment” becomes blurry? These are questions of “naturalized induction.”
Soares’ “Formalizing two problems of realistic world-models” further motivates problems of naturalized induction as relevant to the construction of a theory of general intelligence.
Altair’s “An Intuitive Explanation of Solomonoff Induction” explains Solomonoff’s theory of inductive inference, which is important background knowledge when it comes to understanding open problems of naturalized induction.
Bensinger’s “Naturalized induction” (series) explores questions of naturalized induction in more detail.
Solving problems of naturalized induction requires gaining a better understanding of realistic world-models: What is the set of “possible realities”? What sort of priors about the environment would an ideal agent use? Answers to these questions must not only allow good reasoning, they must allow for the specification of human goals in terms of those world-models.
For example, in Solomonoff induction (and in Hutter’s AIXI), Turing machines are used to model the environment. Pretend that the only thing we value is diamonds (carbon atoms covalently bound to four other carbon atoms). Now, say I give you a Turing machine. Can you tell me how much diamond is within?
In order to design an agent that pursues goals specified in terms of its world models, the agent must have some way of identifying the ontology of our goals (carbon atoms) inside its world models (Turing machines). This “ontology identification” problem is discussed in “Formalizing Two Problems of Realistic World Models” (linked above), and was first introduced by De Blanc:
De Blanc’s “Ontological crises in artificial agents’ value systems” asks how one might make an agent’s goals robust to changes in ontology. If the agent starts with an atomic model of physics (where carbon atoms are ontologically basic) then this may not be hard. But what happens when the agent builds a nuclear model of physics (where atoms are constructed from neutrons and protons)? If the “carbon recognizer” was hard-coded, the agent might fail to identify any carbon in this new world-model, and may start acting strangely (in search of hidden “true carbon”). How could the agent be designed so that it can successfully identify “six-proton atoms” with “carbon atoms” in response to this ontological crisis?
Legg and Hutter’s “Universal Intelligence: A Definition of Machine Intelligence” describes AIXI, a universally intelligent agent in settings where the agent is separate from the environment, and a “scoring metric” used to rate the intelligence of various agent programs in this setting. Hutter’s AIXI and Legg’s scoring metric are very similar in spirit to what we are looking for in response to problems of naturalized induction and ontology identification. The two differences are that AIXI lives in a universe where agent and environment are separated whereas naturalized induction requires a solution where the agent is embedded within the environment, and AIXI maximizes rewards specified in terms of observations whereas we desire a solution that optimizes rewards specified in terms of the outside world.
You can learn more about AIXI in Hutter’s book Universal Artificial Intelligence, although reading Legg’s paper (linked above) is likely sufficient for our purposes.
Decision Theory
Say I give you the following: (1) a computer program describing a universe; (2) a computer program describing an agent; (3) a set of actions available to the agent; (4) a set of preferences specified over the history of states that the universe has been in. I task you with identifying the best action available to the agent, with respect to those preferences. For example, your inputs might be:
def Universe(): outcomes = {Lo, Med, Hi} actions = {One, Two, Three} def Agent(): worldmodel = {Lo: One, Hi: Two, Med: Three} return worldmodel[Hi] territory = {One: Lo, Two: Med, Three: Hi} return territory[Agent()]
def Agent(): worldmodel = {Lo: One, Hi: Two, Med: Three} return worldmodel[Hi]
actions = {One, Two, Three}
Hi > Med > Lo
(Notice how the agent is embedded in the environment.) This is another question that we don’t know how to answer, even in principle. It may seem easy: just iterate over each action, figure out which outcome the agent would get if it took that action, and then pick the action that leads to the best outcome. But as a matter of fact, in this thought experiment, the agent is a deterministic subprocess of a deterministic computer program: there is exactly one action that the agent is going to output, and asking what “would happen” if a deterministic part of a deterministic program did something that it doesn’t do is ill-defined.
In order to evaluate what “would happen” if the agent took a different action, a “counterfactual environment” (where the agent does something that it doesn’t) must be constructed. Satisfactory theories of counterfactual reasoning do not yet exist. We don’t yet understand how to identify the best action available to an agent embedded within its environment, even in theory, even given full knowledge of the universe and our preferences and given unlimited computing power.
Solving this problem will require a better understanding of counterfactual reasoning; this is the domain of decision theory.
Decision TheoryPeterson’s textbook explains the field of normative decision theory in broad strokes. For a quicker survey, with a stronger focus on Newcomblike problems, see Muehlhauser’s “Decision theory FAQ.” Game TheoryMany open problems in decision theory involve multi-agent settings. I have heard good things about Tadelis’ textbook, but have not read it myself. You also may have luck with Scott Alexander’s “Introduction to game theory” on LessWrong. |
Existing methods of counterfactual reasoning turn out to be unsatisfactory both in the short term (in the sense that they systematically achieve poor outcomes on some problems where good outcomes are possible) and in the long term (in the sense that self-modifying agents reasoning using bad counterfactuals would, according to those broken counterfactuals, decide that they should not fix all of their flaws). My talk “Why ain’t you rich?” briefly touches upon both these points. To learn more, I suggest the following resources:
Soares & Fallenstein’s “Toward idealized decision theory” serves as a general overview, and further motivates problems of decision theory as relevant to MIRI’s research program. The paper discusses the shortcomings of two modern decision theories, and discusses a few new insights in decision theory that point toward new methods for performing counterfactual reasoning.
If “Toward idealized decision theory” moves too quickly, this series of blog posts may be a better place to start:
Yudkowsky’s “The true Prisoner’s Dilemma” explains why cooperation isn’t automatically the ‘right’ or ‘good’ option.
Soares’ “Causal decision theory is unsatisfactory” uses the Prisoner’s Dilemma to illustrate the importance of non-causal connections between decision algorithms.
Yudkowsky’s “Newcomb’s problem and regret of rationality” argues for focusing on decision theories that ‘win,’ not just on ones that seem intuitively reasonable. Soares’ “Introduction to Newcomblike problems” covers similar ground.
Soares’ “Newcomblike problems are the norm” notes that human agents probabilistically model one another’s decision criteria on a routine basis.
MIRI’s research has led to the development of “Updateless Decision Theory” (UDT), a new decision theory which addresses many of the shortcomings discussed above.
Hintze’s “Problem class dominance in predictive dilemmas” summarizes UDT’s dominance over other known decision theories, including Timeless Decision Theory (TDT), another theory that dominates CDT and EDT.
Fallenstein’s “A model of UDT with a concrete prior over logical statements” provides a probabilistic formalization.
However, UDT is by no means a solution, and has a number of shortcomings of its own, discussed in the following places:
Slepnev’s “An example of self-fulfilling spurious proofs in UDT” explains how UDT can achieve sub-optimal results due to spurious proofs.
Benson-Tilsen’s “UDT with known search order” is a somewhat unsatisfactory solution. It contains a formalization of UDT with known proof-search order and demonstrates the necessity of using a technique known as “playing chicken with the universe” in order to avoid spurious proofs.
In order to study multi-agent settings, Patrick LaVictoire has developed a modal agents framework, which has also allowed us to use provability logic to make some novel progress in the field of decision theory:
Barasz et al.’s “Robust cooperation in the Prisoner’s Dilemma” allows us to consider agents which decide whether or not to cooperate with each other based only upon what they can prove about each other’s behavior. This prevents infinite regress; in fact, the behavior of two agents which act only according to what they can prove about the behavior of the other can be determined in quadratic time using results from provability logic.
UDT was developed by Wei Dai and Vladimir Slepnev, among others. Dai’s “Towards a new decision theory” introduced the idea, and Slepnev’s “A model of UDT with a halting oracle” provided an early first formalization. Slepnev also described a strange problem with UDT wherein it seems as if agents are rewarded for having less intelligence, in “Agent simulates predictor”.
These blog posts are of historical interest, but nearly all of their content is in ”Toward idealized decision theory”, above.
Logical Uncertainty
Imagine a black box, with one input chute and two output chutes. A ball can be put into the input chute, and it will come out of one of the two output chutes. Inside the black box is a Rube Goldberg machine which takes the ball from the input chute to one of the output chutes.
A perfect probabilistic reasoner who doesn’t know which Rube Goldberg machine is in the box doesn’t know how the box will behave, but if they could figure out which machine is inside the box, then they would know which chute would take the ball. This reasoner is environmentally uncertain.
A realistic reasoner might know which machine is in the box, and might know exactly how the machine works, but may lack the deductive capability to figure out where the machine will drop the ball. This reasoner is logically uncertain.
Probability theory assumes logical omniscience; it assumes that reasoners know all consequences of the things they know. In reality, bounded reasoners are not logically omniscient: we can know precisely which machine the box implements and precisely how the machine works, and just not have the time to deduce where the ball comes out. We reason under logical uncertainty.
A formal theory of reasoning under logical uncertainty does not yet exist. Gaining this understanding is extremely important when it comes to constructing a highly reliable generally intelligent system: whenever an agent reasons about the behavior of complex systems, computer programs, or other agents, it must operate under at least a little logical uncertainty.
To understand the state of the art, a solid understanding of probability theory is a must; consider augmenting the first few chapters of Jaynes with Feller, chapters 1, 5, 6, and 9, and then study the following papers:
Soares & Fallenstein’s “Questions of reasoning under logical uncertainty” provides a general introduction, explaining the field of logical uncertainty and motivating its relevance to MIRI’s research program.
Gaifman’s “Concerning measures in first-order calculi” looked at this problem many years ago. Gaifman has largely focused on a relevant subproblem, which is the assignment of probabilities to different models of a formal system (assuming that once the model is known, all consequences of that model are known). We are now attempting to expand this approach to a more complete notion of logical uncertainty (where a reasoner can know what the model is but not know the implications of that model), but work by Gaifman is still useful to gain a historical context and an understanding of the difficulties surrounding logical uncertainty.
Hutter et al.’s “Probabilities on sentences in an expressive logic” largely looks at the problem of logical uncertainty assuming access to infinite computing power (and many levels of halting oracles). Understanding Hutter’s approach (and what can be done with infinite computing power) helps flesh out our understanding of where the difficult questions lie.
Demski’s “Logical prior probability” provides a computably approximable logical prior. Following Demski, our work largely focuses on the creation of an approximable prior probability distribution over logical sentences, as the act of refining and approximating a logical prior is very similar to the act of reasoning under logical uncertainty in general.
Christiano’s “Non-omniscience, probabilistic inference, and metamathematics” largely follows this approach. This paper provides some early practical considerations about the generation of logical priors, and highlights a few open problems.
For more historical work on this problem, see Gaifman’s “Probabilities over rich languages…” and “Reasoning with limited resources and assigning probabilities to arithmetic statements.”
Vingean Reflection
Much of what makes the AI problem unique is that a sufficiently advanced system will be able to do higher-quality science and engineering than its human programmers. Many of the possible hazards and benefits of an advanced system stem from its potential to bootstrap itself to higher levels of capability, possibly leading to an intelligence explosion.
If an agent achieves superintelligence via recursive self-improvement, then the impact of the resulting system depends entirely upon the ability of the initial system to reason reliably about agents that are more intelligent than itself. What sort of reasoning methods could a system use in order to justify extremely high confidence in the behavior of a yet more intelligent system? We refer to this sort of reasoning as “Vingean reflection”, after Vernor Vinge (1993), who noted that it is not possible in general to precisely predict the behavior of agents which are more intelligent than the reasoner.
A reasoner performing Vingean reflection must necessarily reason abstractly about the more intelligent agent. This will almost certainly require some form of high-confidence logically uncertain reasoning, but in lieu of a working theory of logical uncertainty, reasoning about proofs (using formal logic) is the best available formalism for studying abstract reasoning. As such, a modern study of Vingean reflection requires a background in formal logic:
First-Order LogicMIRI’s existing toy models for studying self-modifying agents are largely based on this logic. Understanding the nuances of first-order logic is crucial for using the tools we have developed for studying formal systems capable of something approaching confidence in similar systems. |
We study Vingean reflection by constructing toy models of agents which are able to gain some form of confidence in highly similar systems. To get to the cutting edge, read the following papers:
Fallenstein & Soares’ “Vingean reflection: Reliable reasoning for self-improving agents” introduces the field of Vingean reflection, and motivates its connection to MIRI’s research program.
Yudkowsky’s “The procrastination paradox” goes into more detail on the need for satisfactory solutions to walk a fine line between the Löbian obstacle (a problem stemming from too little “self-trust”) and unsoundness that come from too much self-trust.
Christiano et al.’s “Definability of truth in probabilistic logic” describes an early attempt to create a formal system that can reason about itself while avoiding paradoxes of self-reference. It succeeds, but has ultimately been shown to be unsound. My walkthrough for this paper may help put it into a bit more context.
Fallenstein & Soares’ “Problems of self-reference in self-improving space-time embedded intelligence” describes our simple suggester-verifier model for studying agents that produce slightly improved versions of themselves, or ’tile’ themselves. The paper demonstrates a toy scenario in which sound agents can successfully tile to (e.g., gain high confidence in) other similar agents.
Yudkowsky & Herreshoff’s “Tiling agents for self-modifying AI” is an older, choppier introduction to Vingean reflection which may be easier to work through using my walkthrough.
If you’re excited about this research topic, there are a number of other relevant tech reports. Unfortunately, most of them don’t explain their motivations well, and have not yet been put into their greater context.
Fallenstein’s “Procrastination in probabilistic logic” illustrates how Christiano et al.’s probabilistic reasoning system is unsound and vulnerable to the procrastination paradox. Yudkowsky’s “Distributions allowing tiling…” takes some early steps towards probabilistic tiling settings.
Fallenstein’s “Decreasing mathematical strength…” describes one unsatisfactory property of Parametric Polymorphism, a partial solution to the Löbian obstacle. Soares’ “Fallenstein’s monster” describes a hackish formal system which avoids the above problem. It also showcases a mechanism for restricting an agent’s goal predicate which can also be used by Parametric Polymorphism to create a less restrictive version of PP than the one explored in the tiling agents paper. Fallenstein’s “An infinitely descending sequence of sound theories…” describes a more elegant partial solution to the Löbian obstacle, which is now among our favored partial solutions.
An understanding of recursive ordinals provides a useful context from which to understand these results, and can be gained by reading Franzén’s “Transfinite progressions: A second look at completeness.”
Corrigibility
As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences.
This field of research is basically brand-new, so all it takes in order to get up to speed is to read a paper or two:
Soares et al.’s “Corrigibility” introduces the field at large, along with a few open problems.
Armstrong’s “Proper value learning through indifference” discusses one potential approach for making agents indifferent between which utility function they maximize, which is a small step towards agents that allow themselves to be modified.
Our current work on corrigibility focuses mainly on a small subproblem known as the “shutdown problem”: how do you construct an agent that shuts down upon the press of a shutdown button, and which does not have incentives to cause or prevent the pressing of the button? Within that subproblem, we currently focus on the utility indifference problem: how could you construct an agent which allows you to switch which utility function it maximizes, without giving it incentives to affect whether the switch occurs? Even if we had a satisfactory solution to the utility indifference problem, this would not yield a satisfactory solution to the shutdown problem, as it still seems difficult to adequately specify “shutdown behavior” in a manner that is immune to perverse instantiation. Stuart Armstrong has written several blog posts about the specification of “reduced impact” AGIs:
These first attempts are not yet a full solution, but they should get you up to speed on our current understanding of the problem.
Early work in corrigibility can be found on the web forum Less Wrong. Most of the relevant results are captured in the above papers. One of the more interesting of these is “Cake or Death”, an example of the “motivated value selection” problem. In this example, an agent with uncertainty about its utility function benefits from avoiding information that reduces its uncertainty.
Armstrong’s “The mathematics of reduced impact: help needed” lists initial ideas for specifying reduced-impact agents, and his “Reduced impact in practice: randomly sampling the future” sketches a simple method for assessing whether a future has been impacted.
Armstrong’s “Utility indifference” outlines the original utility indifference idea, and is largely interesting for historical reasons. It is subsumed by the “Proper value learning through indifference” paper linked above.
Value Learning
Since our own understanding of our values is fuzzy and incomplete, perhaps the most promising approach for loading values into a powerful AI is to specify a criterion for the agent to learn our values incrementally. But this presents a number of interesting problems:
Say you construct a training set containing many outcomes filled with happy humans (labeled “good”) and other outcomes filled with sad humans (labeled “bad”). The simplest generalization, from this data, might be that humans really like human-shaped smiling-things: this agent may then try to build many tiny animatronic happy-looking people.
Value learning must be an online process: the system must be able to identify ambiguities and raise queries about these ambiguities to the user. It must not only identify cases that it doesn’t know how to classify (such as cases where it cannot tell whether a face looks happy or sad), but also identify dimensions along which the training data gives no information (such as when your training data never shows outcomes filled with human-shaped automatons that look happy, labeled as worthless).
Of course, ambiguity identification alone isn’t enough: you don’t want a system that spends the first three weeks asking for clarification on whether humans are still worthwhile when they are at different elevations, or when the wind is blowing, before finally (after the operators have stopped paying attention) asking whether it’s important that the human-shaped things be acting of their own will.
In order for an agent to reliably learn our intentions, the agent must be constructing and refining a model of its operator and using that model to inform its queries and alter its preferences. To learn more about these problems and others, see the following:
Soares’ “The value learning problem” provides a general overview of a few open problems related to value learning.
Dewey’s “Learning what to value” further discusses the difficulty of value learning.
The orthogonality thesis argues that value learning will not be solved by default.
MacAskill’s “Normative Uncertainty” provides a framework for discussing normative uncertainty. Be warned, the full work, while containing many insights, is very long. You can get away with skimming parts and/or skipping around some, especially if you’re more excited about other areas of active research.
One approach to resolving normative uncertainty is Bostrom & Ord’s “parliamentary model,” which suggests that value learning is somewhat equivalent to a voter aggregation problem, and that many value learning systems can be modeled as parliamentary voting systems (where the voters are possible utility functions).
Owen Cotton-Barratt’s “Geometric reasons for normalising…” discusses the normalization of utility functions; this is relevant to toy models of reasoning under moral uncertainty.
Fallenstein & Stiennon’s “Loudness” discusses a concern with aggregating utility functions stemming from the fact that the preferences encoded by utility functions are preserved under positive affine transformation (e.g. as the utility function is scaled or shifted). This implies that special care is required in order to normalize the set of possible functions.
Other Tools
Mastery in any subject can be a very powerful tool, especially in the realm of mathematics, where seemingly disjoint topics are actually deeply connected. Many fields of mathematics have the property that if you understand them very very well, then that understanding is useful no matter where you go. With that in mind, while the subjects listed below are not necessary in order to understand MIRI’s active research, an understanding of each of these subjects constitutes an additional tool in the mathematical toolbox that will often prove quite useful when doing new research.
Discrete MathTextbook available online. Most math studies either continuous or discrete structures. Many people find discrete mathematics more intuitive, and a solid understanding of discrete mathematics will help you gain a quick handle on the discrete versions of many other mathematical tools, such as group theory, topology, and information theory. Type TheorySet theory commonly serves as the foundation for modern mathematics, but it’s not the only available candidate. Type theory can also serve as a foundation for mathematics, and in many cases, type theory is a better fit for the problems at hand. Type theory also bridges much of the theoretical gap between computer programs and mathematical proofs, and is therefore often relevant to certain types of AI research. Program VerificationProgram verification techniques allow programmers to become confident that a specific program will actually act according to some specification. (It is, of course, still difficult to validate that the specification describes the intended behavior.) While MIRI’s work is not currently concerned with verifying real-world programs, it is quite useful to understand what modern program verification techniques can and cannot do. |
Understanding the Mission
Why do this kind of research in the first place?
|