The AI Alignment Problem: Why It’s Hard, and Where to Start

What is it?: A talk by Eliezer Yudkowsky given at Stanford University on May 5, 2016 for the Symbolic Systems Distinguished Speaker series.

Talk: Full video.

Transcript: Full (including Q&A), partial (including select slides).

Slides without transitions: High-quality, low-quality.

Slides with transitions: High-quality, low-quality.

Abstract: If we can build sufficiently advanced machine intelligences, what goals should we point them at? The frontier open problems on this subject are less, “A robot may not injure a human, nor through inaction allow a human to come to harm,” and more, “If you could formally specify the preferences of an arbitrarily smart and powerful agent, could you get it to safely move one strawberry onto a plate?” This talk will discuss some of the open technical problems in AI alignment, the probable difficulties that make those problems hard, and the bigger picture into which they fit; as well as what it’s like to work in this relatively new field.

Notes, references, and resources for learning more are collected here.

Agents and their utility functions

The best general introductions to the topic of smarter-than-human artificial intelligence are plausibly Nick Bostrom’s Superintelligence and Stuart Armstrong’s Smarter Than Us. For a shorter explanation, see my recent guest post on EconLog.

A fuller version of Stuart Russell’s quotation (from Edge.org):
There have been many unconvincing arguments [for worrying about AI disasters] — especially those involving blunt applications of Moore’s law or the spontaneous emergence of consciousness and evil intent. Many of the contributors to this conversation seem to be responding to those arguments and ignoring the more substantial arguments proposed by Omohundro, Bostrom, and others.

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:

The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.
Isaac Asimov introduced the three laws of robotics in the 1942 short story “Runaround.”

The Peter Norvig / Stuart Russell quotation is from Artificial Intelligence: A Modern Approach, the top undergraduate textbook in AI.

The arguments I give for having a utility function are standard, and can be found in e.g. Poole and Mackworth’s Artificial Intelligence: Foundations of Computational Agents. I write about normative rationality at greater length in Rationality: From AI to Zombies (e.g., in The Allais Paradox and Zut Allais!).

Some AI alignment subproblems

My discussion of low-impact agents borrows from a forthcoming research proposal by Taylor et al.: “Value Alignment for Advanced Machine Learning Systems.” For an overview, see Low Impact on Arbital.

The suspend problem is discussed (under the name “shutdown problem”) in Soares et al.’s “Corrigibility.” The stable policy proposal comes from Taylor’s Maximizing a Quantity While Ignoring Effect Through Some Channel.

Poe’s argument against the possibility of machine chess is from the 1836 essay “Maelzel’s Chess-Player.”

Fallenstein and Soares’ “Vingean Reflection” is currently the most up-to-date overview of work in goal stability. Other papers cited:

Yudkowsky and Herreshoff (2013). “Tiling Agents for Self-Modifying AI, and the Löbian Obstacle.” Working paper.
Christiano et al. (2013). “Definability of Truth in Probabilistic Logic.” Working paper.
Fallenstein and Kumar (2015). “Proof-Producing Reflection for HOL: With an Application to Model Polymorphism.” In Interactive Theorem Proving: 6th International Conference, Proceedings.
Yudkowsky (2014). “Distributions Allowing Tiling of Staged Subjective EU Maximizers.” Technical report 2014-1. Machine Intelligence Research Institute.

Why expect difficulty?

For more about orthogonal final goals and convergent instrumental strategies, see Bostrom’s “The Superintelligent Will” (also reproduced in Superintelligence). Benson-Tilsen and Soares’ “Formalizing Convergent Instrumental Goals” provides a toy model.

The smile maximizer is based on a proposal by Bill Hibbard. This example and Jürgen Schmidhuber’s compressibility proposal are discussed more fully in Soares’ “The Value Learning Problem.” See also the Arbital pages on Edge Instantiation, Context Disaster, and Nearest Unblocked Strategy.

See the MIRI FAQ and GiveWell’s report on potential risks from advanced AI for quick explanations of why AI is likely to be able to surpass human cognitive capabilities, among other topics. Bensinger’s When AI Accelerates AI notes general reasons to expect capability speedup, while “Intelligence Explosion Microeconomics” delves into the specific question of whether self-modifying AI is likely to result in accelerating AI progress.

Muehlhauser notes the analogy between computer security and AI alignment research in AI Risk and the Security Mindset.

Where we are now

MIRI’s technical research agenda summarizes many of the field’s core open problems.

For more on conservatism, see the Arbital post Conservative Concept Boundary and Taylor’s Conservative Classifiers. Also on Arbital: introductions to mild optimization and act-based agents.

Papers cited in the slides:

Armstrong and Levinstein (2015). “Reduced Impact Artificial Intelligences.” Working paper.
Soares (2015). “Formalizing Two Problems of Realistic World-Models.” Technical report 2015-3. Machine Intelligence Research Institute.
Taylor (2016). “Quantilizers: A Safer Alternative to Maximizers for Limited Optimization.” Paper presented at the AAAI 2016 AI, Ethics and Society Workshop.
Evans et al. (2015). “Learning the Preferences of Bounded Agents.” Paper presented at the NIPS 2015 Workshop on Bounded Optimality.
Hutter (2007). “Universal Algorithmic Intelligence: A Mathematical Top→Down Approach.” arXiv:cs/0701125 [cs.AI].
LaVictoire et al. (2014). “Program Equilibrium in the Prisoner’s Dilemma via Löb’s Theorem.” Paper presented at the AAAI 2014 Multiagent Interaction without Prior Coordination Workshop.
Fallenstein et al. (2015). “Reflective Oracles: A Foundation for Game Theory in Artificial Intelligence.” In Proceedings of LORI 2015.

Email contact@intelligence.org if you have any questions, and see intelligence.org/get-involved for information about opportunities to collaborate on AI alignment projects.