New paper: “Alignment for advanced machine learning systems”

 |   |  Papers

Alignment for Advanced Machine Learning SystemsMIRI’s research to date has focused on the problems that we laid out in our late 2014 research agenda, and in particular on formalizing optimal reasoning for bounded, reflective decision-theoretic agents embedded in their environment. Our research team has since grown considerably, and we have made substantial progress on this agenda, including a major breakthrough in logical uncertainty that we will be announcing in the coming weeks.

Today we are announcing a new research agenda, “Alignment for advanced machine learning systems.” Going forward, about half of our time will be spent on this new agenda, while the other half is spent on our previous agenda. The abstract reads:

We survey eight research areas organized around one question: As learning systems become increasingly intelligent and autonomous, what design principles can best ensure that their behavior is aligned with the interests of the operators? We focus on two major technical obstacles to AI alignment: the challenge of specifying the right kind of objective functions, and the challenge of designing AI systems that avoid unintended consequences and undesirable behavior even in cases where the objective function does not line up perfectly with the intentions of the designers.

Open problems surveyed in this research proposal include: How can we train reinforcement learners to take actions that are more amenable to meaningful assessment by intelligent overseers? What kinds of objective functions incentivize a system to “not have an overly large impact” or “not have many side effects”? We discuss these questions, related work, and potential directions for future research, with the goal of highlighting relevant research topics in machine learning that appear tractable today.

Co-authored by Jessica Taylor, Eliezer Yudkowsky, Patrick LaVictoire, and Andrew Critch, our new report discusses eight new lines of research (previously summarized here). Below, I’ll explain the rationale behind these problems, as well as how they tie in to our old research agenda and to the new “Concrete problems in AI safety” agenda spearheaded by Dario Amodei and Chris Olah of Google Brain.

Read more »

Submission to the OSTP on AI outcomes

 |   |  News

The White House Office of Science and Technology Policy recently put out a request for information on “(1) The legal and governance implications of AI; (2) the use of AI for public good; (3) the safety and control issues for AI; (4) the social and economic implications of AI;” and a variety of related topics. I’ve reproduced MIRI’s submission to the RfI below:


I. Review of safety and control concerns

AI experts largely agree that AI research will eventually lead to the development of AI systems that surpass humans in general reasoning and decision-making ability. This is, after all, the goal of the field. However, there is widespread disagreement about how long it will take to cross that threshold, and what the relevant AI systems are likely to look like (autonomous agents, widely distributed decision support systems, human/AI teams, etc.).

Despite the uncertainty, a growing subset of the research community expects that advanced AI systems will give rise to a number of foreseeable safety and control difficulties, and that those difficulties can be preemptively addressed by technical research today. Stuart Russell, co-author of the leading undergraduate textbook in AI and professor at U.C. Berkeley, writes:

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.

Researchers’ worries about the impact of AI in the long term bear little relation to the doomsday scenarios most often depicted in Hollywood movies, in which “emergent consciousness” allows machines to throw off the shackles of their programmed goals and rebel. The concern is rather that such systems may pursue their programmed goals all too well, and that the programmed goals may not match the intended goals, or that the intended goals may have unintended negative consequences.

These challenges are not entirely novel. We can compare them to other principal-agent problems where incentive structures are designed with the hope that blind pursuit of those incentives promotes good outcomes. Historically, principal-agent problems have been difficult to solve even in domains where the people designing the incentive structures can rely on some amount of human goodwill and common sense. Consider the problem of designing tax codes to have reliably beneficial consequences, or the problem of designing regulations that reliably reduce corporate externalities. Advanced AI systems naively designed to optimize some objective function could result in unintended consequences that occur on digital timescales, but without goodwill and common sense to blunt the impact.

Given that researchers don’t know when breakthroughs will occur, and given that there are multiple lines of open technical research that can be pursued today to address these concerns, we believe it is prudent to begin serious work on those technical obstacles to improve the community’s preparedness.

Read more »

July 2016 Newsletter

 |   |  Newsletters

Research updates

General updates

News and links

  • The White House is requesting information on “safety and control issues for AI,” among other questions. Public submissions will be accepted through July 22.
  • Concrete Problems in AI Safety“: Researchers from Google Brain, OpenAI, and academia propose a very promising new AI safety research agenda. The proposal is showcased on the Google Research Blog and the OpenAI Blog, as well as the Open Philanthropy Blog, and has received press coverage from Bloomberg, The Verge, and MIT Technology Review.
  • After criticizing the thinking behind OpenAI earlier in the month, Alphabet executive chairman Eric Schmidt comes out in favor of AI safety research:

    Do we worry about the doomsday scenarios? We believe it’s worth thoughtful consideration. Today’s AI only thrives in narrow, repetitive tasks where it is trained on many examples. But no researchers or technologists want to be part of some Hollywood science-fiction dystopia. The right course is not to panic—it’s to get to work. Google, alongside many other companies, is doing rigorous research on AI safety, such as how to ensure people can interrupt an AI system whenever needed, and how to make such systems robust to cyberattacks.


  1. Inspiration for these gyms came in part from Chris Olah and Dario Amodei in a conversation with Rafael. 

New paper: “A formal solution to the grain of truth problem”

 |   |  Papers

A Formal Solution to the Grain of Truth Problem

Future of Humanity Institute Research Fellow Jan Leike and MIRI Research Fellows Jessica Taylor and Benya Fallenstein have just presented new results at UAI 2016 that resolve a longstanding open problem in game theory: “A formal solution to the grain of truth problem.”

Game theorists have techniques for specifying agents that eventually do well on iterated games against other agents, so long as their beliefs contain a “grain of truth” — nonzero prior probability assigned to the actual game they’re playing. Getting that grain of truth was previously an unsolved problem in multiplayer games, because agents can run into infinite regresses when they try to model agents that are modeling them in turn. This result shows how to break that loop: by means of reflective oracles.

In the process, Leike, Taylor, and Fallenstein provide a rigorous and general foundation for the study of multi-agent dilemmas. This work provides a surprising and somewhat satisfying basis for approximate Nash equilibria in repeated games, folding a variety of problems in decision and game theory into a common framework.

The paper’s abstract reads:

A Bayesian agent acting in a multi-agent environment learns to predict the other agents’ policies if its prior assigns positive probability to them (in other words, its prior contains a grain of truth). Finding a reasonably large class of policies that contains the Bayes-optimal policies with respect to this class is known as the grain of truth problem. Only small classes are known to have a grain of truth and the literature contains several related impossibility results.

In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of policies that contains all computable policies as well as Bayes-optimal policies for every lower semicomputable prior over the class. When the environment is unknown, Bayes-optimal agents may fail to act optimally even asymptotically. However, agents based on Thompson sampling converge to play ε-Nash equilibria in arbitrary unknown computable multi-agent environments. While these results are purely theoretical, we show that they can be computationally approximated arbitrarily closely.

Traditionally, when modeling computer programs that model the properties of other programs (such as when modeling an agent reasoning about a game), the first program is assumed to have access to an oracle (such as a halting oracle) that can answer arbitrary questions about the second program. This works, but it doesn’t help with modeling agents that can reason about each other.

While a halting oracle can predict the behavior of any isolated Turing machine, it cannot predict the behavior of another Turing machine that has access to a halting oracle. If this were possible, the second machine could use its oracle to figure out what the first machine-oracle pair thinks it will do, at which point it can do the opposite, setting up a liar paradox scenario. For analogous reasons, two agents with similar resources, operating in real-world environments without any halting oracles, cannot perfectly predict each other in full generality.

Game theorists know how to build formal models of asymmetric games between a weaker player and a stronger player, where the stronger player understands the weaker player’s strategy but not vice versa. For the reasons above, however, games between agents of similar strength have resisted full formalization. As a consequence of this, game theory has until now provided no method for designing agents that perform well on complex iterated games containing other agents of similar strength.

Read more »

June 2016 Newsletter

 |   |  Newsletters

Research updates

General updates

News and links

New paper: “Safely interruptible agents”

 |   |  Papers

Safely Interruptible AgentsGoogle DeepMind Research Scientist Laurent Orseau and MIRI Research Associate Stuart Armstrong have written a new paper on error-tolerant agent designs, “Safely interruptible agents.” The paper is forthcoming at the 32nd Conference on Uncertainty in Artificial Intelligence.

Abstract:

Reinforcement learning agents interacting with a complex environment like the real world are unlikely to behave optimally all the time. If such an agent is operating in real-time under human supervision, now and then it may be necessary for a human operator to press the big red button to prevent the agent from continuing a harmful sequence of actions—harmful either for the agent or for the environment—and lead the agent into a safer situation. However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interruptions, for example by disabling the red button — which is an undesirable outcome.

This paper explores a way to make sure a learning agent will not learn to prevent (or seek!) being interrupted by the environment or a human operator. We provide a formal definition of safe interruptibility and exploit the off-policy learning property to prove that either some agents are already safely interruptible, like Q-learning, or can easily be made so, like Sarsa. We show that even ideal, uncomputable reinforcement learning agents for (deterministic) general computable environments can be made safely interruptible.

Orseau and Armstrong’s paper constitutes a new angle of attack on the problem of corrigibility. A corrigible agent is one that recognizes it is flawed or under development and assists its operators in maintaining, improving, or replacing itself, rather than resisting such attempts.

Read more »

May 2016 Newsletter

 |   |  Newsletters

Research updates

General updates

News and links

A new MIRI research program with a machine learning focus

 |   |  MIRI Strategy

I’m happy to announce that MIRI is beginning work on a new research agenda, “value alignment for advanced machine learning systems.” Half of MIRI’s team — Patrick LaVictoire, Andrew Critch, and I — will be spending the bulk of our time on this project over at least the next year. The rest of our time will be spent on our pre-existing research agenda.

MIRI’s research in general can be viewed as a response to Stuart Russell’s question for artificial intelligence researchers: “What if we succeed?” There appear to be a number of theoretical prerequisites for designing advanced AI systems that are robust and reliable, and our research aims to develop them early.

Our general research agenda is agnostic about when AI systems are likely to match and exceed humans in general reasoning ability, and about whether or not such systems will resemble present-day machine learning (ML) systems. Recent years’ impressive progress in deep learning suggests that relatively simple neural-network-inspired approaches can be very powerful and general. For that reason, we are making an initial inquiry into a more specific subquestion: “What if techniques similar in character to present-day work in ML succeed in creating AGI?”.

Much of this work will be aimed at improving our high-level theoretical understanding of task-directed AI. Unlike what Nick Bostrom calls “sovereign AI,” which attempts to optimize the world in long-term and large-scale ways, task AI is limited to performing instructed tasks of limited scope, satisficing but not maximizing. Our hope is that investigating task AI from an ML perspective will help give information about both the feasibility of task AI and the tractability of early safety work on advanced supervised, unsupervised, and reinforcement learning systems.

To this end, we will begin by investigating eight relevant technical problems:

Read more »