A new MIRI research program with a machine learning focus

 |   |  MIRI Strategy

I’m happy to announce that MIRI is beginning work on a new research agenda, “value alignment for advanced machine learning systems.” Half of MIRI’s team — Patrick LaVictoire, Andrew Critch, and I — will be spending the bulk of our time on this project over at least the next year. The rest of our time will be spent on our pre-existing research agenda.

MIRI’s research in general can be viewed as a response to Stuart Russell’s question for artificial intelligence researchers: “What if we succeed?” There appear to be a number of theoretical prerequisites for designing advanced AI systems that are robust and reliable, and our research aims to develop them early.

Our general research agenda is agnostic about when AI systems are likely to match and exceed humans in general reasoning ability, and about whether or not such systems will resemble present-day machine learning (ML) systems. Recent years’ impressive progress in deep learning suggests that relatively simple neural-network-inspired approaches can be very powerful and general. For that reason, we are making an initial inquiry into a more specific subquestion: “What if techniques similar in character to present-day work in ML succeed in creating AGI?”.

Much of this work will be aimed at improving our high-level theoretical understanding of task-directed AI. Unlike what Nick Bostrom calls “sovereign AI,” which attempts to optimize the world in long-term and large-scale ways, task AI is limited to performing instructed tasks of limited scope, satisficing but not maximizing. Our hope is that investigating task AI from an ML perspective will help give information about both the feasibility of task AI and the tractability of early safety work on advanced supervised, unsupervised, and reinforcement learning systems.

To this end, we will begin by investigating eight relevant technical problems:

1. Inductive ambiguity detection.

How can we design a general methodology for ML systems (such as classifiers) to identify when the classification of a test instance is underdetermined by training data?

For example: If an ambiguity-detecting classifier is designed to distinguish images of tanks from images of non-tanks, and the training set only contains images of tanks on cloudy days and non-tanks on sunny days, this classifier ought to detect that the classification of an image of a tank on a sunny day is ambiguous, and pose some query for its operators to disambiguate it and avoid errors.

While past and current work in active learning and statistical learning theory more broadly has made progress towards this goal, more work is necessary to establish realistic statistical bounds on the error rates and query rates of real-world systems in advance of their deployment in complex environments.

2. Informed oversight.

How might we train a reinforcement learner to output both an action and a “report” comprising information to help a human evaluate its action?

For example: If a human is attempting to train a reinforcement learner to output original stories, then in evaluating the story, the human will want to know some information about the story (such as whether it has been plagiarized from another story) that may be difficult to determine by looking at the story itself.

3. Safe training procedures for human-imitators.

How might we design a ML system that imitates humans performing some task that involves rich outputs (such as answering questions in natural language), to the best of the ML system’s abilities?

While there are existing approaches to imitation learning and generative models, these have some theoretical shortcomings that prevent them from fully solving the general problem. In particular, a generative adversarial model trained on human actions only has an incentive to imitate aspects of the human that the adversary can detect; thus, issues similar to the plagiarism problem from (2) can arise.

4. Conservative concepts.

How might we design a system that, given some positive examples of a concept, can synthesize new instances of the concept without synthesizing edge cases of it?

For example: If we gave the system detailed information about 100 human-created burritos as training data, it should manufacture additional burritos while avoiding edge cases such as extremely small burritos (even though these could still be considered burritos).

By default, most objective functions will lead to such edge cases (say, because small burritos are cheaper to manufacture). Can we develop a general technique for avoiding this problem?

5. Specifying environmental goals using sensory data.

The goals of most ML systems are defined in terms of sensory data, such as discounted reward over time. While these sensory goals can be useful proxies for environmental goals, they are not identical to them: tricking your own sensors into perceiving a strawberry in the room is not the the same as actually having one there. How might we create systems that directly pursue useful goals in their environment, such as “make there be a strawberry in this room”, in a way that is robust to opportunities for the system to interfere with the proxies we used to specify the goal?

6. Low-impact agents.

We would prefer a highly intelligent task-oriented system to generally avoid creating large unintended-by-us side effects in pursuit of its goal, and also to notify us of any large impacts that might result from achieving its goal. For example, if we ask it to build a house for a homeless family, it should know implicitly that it should avoid destroying nearby houses for materials — a large side effect — while still having the desirable large follow-on effect of improving the family’s socioeconomic situation. For any specific task we can specify ad-hoc cost functions for side effects like nearby house destruction, but since we cannot always anticipate such costs in advance, we want a quantitative understanding of how to generally limit an AI’s side effects (without also limiting its ability to have large positive intended impacts).

7. Mild optimization.

In the event that the low impact criteria we employ in (6) are flawed or missing some key considerations, we would like some way to tell our system “don’t try too hard” or “don’t exercise too much optimization power”, since intuitively, large impacts are more likely to result from large efforts (although the correspondence is imperfect).

Less capable AI systems tend to be less likely to have large unintended side effects, and we would like to know how to make even highly capable systems safe in a similar way, by having them apply less “effort”. With a satisfactory encoding of our notion of “effort”, we could then program a system to impose absolute limits on its total effort, or to use only as much effort as is needed to achieve its objective with reasonably high probability, thus potentially limiting impacts that we did not manage to measure in (6).

8. Averting instrumental pressures.

How might we design a system that is not motivated to pursue certain convergent instrumental goals — such as gaining additional resources — even when pursuing these goals would help it achieve its main objective?

In particular, we may wish to build a system that has no incentive to cause or prevent its own shutdown/suspension. This relates to (6) and (7) in that instrumental pressures like “ensure my continued operation” can incentivize large impacts/efforts. However, this is a distinct agenda item because it may be possible to completely eliminate certain instrumental incentives in a way that would apply even before solutions to (6) and (7) would take effect.

Having identified these topics of interest, we expect our work on this agenda to be timely. The idea of “robust and beneficial” AI has recently received increased attention as a result of the new wave of breakthroughs in machine learning. The kind of theoretical work in this project has more obvious connections to the leading paradigms in AI and ML than, for example, our recent work in logical uncertainty or in game theory, and therefore lends itself better to collaborations with AI/ML researchers in the near future.


Thanks to Eliezer Yudkowsky and Paul Christiano for seeding many of the initial ideas for these research directions, to Patrick LaVictoire, Andrew Critch, and other MIRI researchers for helping develop these ideas, and to Chris Olah, Dario Amodei, and Jacob Steinhardt for valuable discussion.

  • http://malcolmocean.com Malcolm Ocean

    Kudos for taking this on!

    Not sure I quite get why #3 has “safe” in the title… it seems like the challenge is that by default, the machine intelligence might learn to imitate features that aren’t important, and miss important ones that are, but that we mightn’t notice? And this could be dangerous because it might e.g. forget to care about human lives when giving advice (vague example) because no such distinction came up in training?

    But how is this different from the general case of #1 then?

    • Jessica Taylor

      Problem #1 and problem #3 are separate issues that make it hard to train something to imitate a human robustly.

      Roughly, problem #3 is to set up a game for a human-imitator such that:

      1. If there’s a way to imitate a human efficiently, then the imitator can imitate a human this way and get a high score in the game.

      2. If the imitator acts differently from how a human would act, then it will get a suboptimal score in the game (compared to if it hadn’t differed from human behavior in that way).

      Generative adversarial networks (http://arxiv.org/abs/1406.2661) fail 2, since the imitator is only penalized for differences from human behavior that the adversary can detect (which doesn’t include things like plagiarism; see https://medium.com/ai-control/the-informed-oversight-problem-1b51b4f66b35 for discussion of possible failures of this form). I suspect that variational autoencoders (http://arxiv.org/abs/1312.6114) fail 1 since they can only implement “reversible” computations. So the challenge is to satisfy both conditions at once. A failure of either of these conditions could be unsafe, since it means that our training process will probably result in something that imitates humans badly in some way or another (though it’s not clear when exactly bad imitations are unsafe; this is another issue to study).

      Even if we succeed in defining a game like this, a variant of problem #1 remains: a machine learning system isn’t guaranteed to get a high score in this game, if it doesn’t already have enough training data to know how to imitate a human well. Problem #1 above studies this problem in the case of classifiers: even when there are only 2 actions available (classify as positive or negative), it’s hard to know when there’s enough training data to determine the right answer.

  • http://tinyurl.com/ogzkd6x Ricardo Cruz

    With regard to #1, I think there are rankers that can respond with “I don’t know” on ambiguous borderline cases. And rankers can be used as classifiers.

    #4 sounds especially interesting because, like a few of the others, it sounds like it will involve learning an utility function (or some ranking of preferences), but, unlike the others, it sounds like there should be a few simple, clever methods to handle it.

    • Jessica Taylor

      As you said, there are currently machine learning frameworks that answer “I don’t know” on ambiguous cases, such as KWIK learning (http://www.research.rutgers.edu/~lihong/pub/Li08Knows.pdf). KWIK learning in particular has some nice theoretical properties, but as far as I know it can’t apply to very complex hypothesis classes (such as neural networks).

  • Tetiana Ivanova

    “Informed oversight” and “ambiguity detection” sound more like “interpretability” to me. This is an active and extremely important area of research. The challenge can be rephrased as follows: how do we create statistically robust frameworks for interpreting ML models? For example, it is not always possible to clearly state the high level rules that have been uncovered by, say, a deep neural net. Automating the process of interpretation and checking it for statistical robustness has not been achieved, but if that is done, it will essentially solve both #1 and #2.

  • Avi Eisenberg

    “If we gave the system detailed information about 100 human-created burritos as training data, it should manufacture additional burritos while avoiding edge cases such as extremely small burritos (even though these could still be considered burritos).”

    Hm. Some concept of reflective consistency under substitution seems like it might be useful I.e. if we develop another 100 burritos, choosing a random sample of 100 to run our program under, choosing the original 100, and choosing the *new* 100, should all yield similar outcomes for more burritos. That limits biased variance in burritos, because variance biased towards minimizing one factor will multiply when applied to itself. Unbiased random variance seems like it might also be limited, because it wouldn’t revert to the original when compounded.

    This also helps with low impact, because too much optimization would likely reduce the entropy and makes it impossible to run in reverse.

    (To clarify a bit, we want the prior of picking out a particular burrito using our algorithm to equal the prior of picking out the actual burritos in the training set, given the output of our algorithm. This obviously leads to infinite recursion, but if formalized there might be a way around that. Also needs some concept of what outputs are “possible”. If you already have a way to determine what burritos are given a sample, you can compare the classification given each of the samples above for similarity.))

  • Rafael Cosman

    Great work @disqus_39NUhtgtok:disqus!

  • Itai Bar-Natan

    Why only half of your research team? Beforehand, the biggest obstacle to FAI research was that we had no idea in what form artificial intelligence will take, so you had to choose a research program with a huge risk that it might turn out irrelevant. Now that we have more information, we can see a direction that’s far more likely to be pertinent to actual AIs. Given that this is the superior direction to go, why not pivot and push your entire research program to it?

    • http://www.nothingismere.com/ Rob Bensinger

      We’re continuing to work on both programs because they’re likely to be complementary, and work from both is likely to end up being necessary for developing alignable AI systems. See https://intelligence.org/2016/07/27/alignment-machine-learning/ for a fuller discussion (and a link to the paper), noting points of overlap between some of the problems. See also e.g. https://intelligence.org/files/OpenPhil2016Supplement.pdf:

      “The reason why I care about logical uncertainty and decision theory problems is something more like this: The whole AI problem can be thought of as a particular logical uncertainty problem, namely, the problem of taking a certain function f : Q → R and finding an input that makes the output large. To see this, let f be the function that takes the AI agent’s next action (encoded in Q) and determines how ‘good’ the universe is if the agent takes that action. The reason we need a principled theory of logical uncertainty is so that we can do function optimization, and the reason we need a principled decision theory is so we can pick the right version of the ‘if the AI system takes that action. . . ‘ function.”

      Better than thinking of research areas like ‘reasoning under logical uncertainty’ as a fallback for scenarios where we don’t know how AI will be developed, is to think of them as likely prerequisites for developing a sufficiently deep understanding of AI that (a) it is possible to build reliably alignable systems at all, and (b) it is possible to make progress toward this early on with moderate confidence. Problems related to logical uncertainty, computational reflection, decision theory, etc. are often blockers on the other problems we want to work on, e.g., corrigibility and value learning. See for example footnote 6 in https://intelligence.org/2016/09/12/new-paper-logical-induction/.

  • Kevin S Van Horn

    What’s the fundamental difficulty with (1)? Any sort of Bayesian inference will automatically give you a predictive probability distribution that includes uncertainty in the parameter estimates.

    • Jessica Taylor

      The discussion of (1) in the paper discusses Bayesian approaches to the problem: https://intelligence.org/files/AlignmentMachineLearning.pdf

      The short answer is that most work in e.g. neural networks is not very Bayesian, and more work is required to make Bayesian approaches practical. Additionally, the problem of designing good priors is philosophically difficult; most Bayesian machine learning uses relatively simple priors that are not capturing all the prior information that the programmers have access to.