New paper: “Risks from learned optimization”

 |   |  Papers

Risks from Learned Optimization in Advanced Machine Learning SystemsEvan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant have a new paper out: “Risks from learned optimization in advanced machine learning systems.”

The paper’s abstract:

We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this paper.

We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.

The critical distinction presented in the paper is between what an AI system is optimized to do (its base objective) and what it actually ends up optimizing for (its mesa-objective), if it optimizes for anything at all. The authors are interested in when ML models will end up optimizing for something, as well as how the objective an ML model ends up optimizing for compares to the objective it was selected to achieve.

The distinction between the objective a system is selected to achieve and the objective it actually optimizes for isn’t new. Eliezer Yudkowsky has previously raised similar concerns in his discussion of optimization daemons, and Paul Christiano has discussed such concerns in “What failure looks like.”

The paper’s contents have also been released this week as a sequence on the AI Alignment Forum, cross-posted to LessWrong. As the authors note there:

We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we plan to present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems.

Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning.

 

Sign up to get updates on new MIRI technical results

Get notified every time a new technical paper is published.

 

June 2019 Newsletter

 |   |  Newsletters

2018 in review

 |   |  MIRI Strategy

Our primary focus at MIRI in 2018 was twofold: research—as always!—and growth.

Thanks to the incredible support we received from donors the previous year, in 2018 we were able to aggressively pursue the plans detailed in our 2017 fundraiser post. The most notable goal we set was to “grow big and grow fast,” as our new research directions benefit a lot more from a larger team, and require skills that are a lot easier to hire for. To that end, we set a target of adding 10 new research staff by the end of 2019.

2018 therefore saw us accelerate the work we started in 2017, investing more in recruitment and shoring up the foundations needed for our ongoing growth. Since our 2017 fundraiser post, we’re up 3 new research staff, including noted Haskell developer Edward Kmett. I now think that we’re most likely to hit 6–8 hires by the end of 2019, though hitting 9–10 still seems quite possible to me, as we are still engaging with many promising candidates, and continue to meet more.

Overall, 2018 was a great year for MIRI. Our research continued apace, and our recruitment efforts increasingly paid out dividends.
Read more »

May 2019 Newsletter

 |   |  Newsletters

New paper: “Delegative reinforcement learning”

 |   |  Papers

Delegative Reinforcement LearningMIRI Research Associate Vanessa Kosoy has written a new paper, “Delegative reinforcement learning: Learning to avoid traps with a little help.” Kosoy will be presenting the paper at the ICLR 2019 SafeML workshop in two weeks. The abstract reads:

Most known regret bounds for reinforcement learning are either episodic or assume an environment without traps. We derive a regret bound without making either assumption, by allowing the algorithm to occasionally delegate an action to an external advisor. We thus arrive at a setting of active one-shot model-based reinforcement learning that we call DRL (delegative reinforcement learning.)

The algorithm we construct in order to demonstrate the regret bound is a variant of Posterior Sampling Reinforcement Learning supplemented by a subroutine that decides which actions should be delegated. The algorithm is not anytime, since the parameters must be adjusted according to the target time discount. Currently, our analysis is limited to Markov decision processes with finite numbers of hypotheses, states and actions.

The goal of Kosoy’s work on DRL is to put us on a path toward having a deep understanding of learning systems with human-in-the-loop and formal performance guarantees, including safety guarantees. DRL tries to move us in this direction by providing models in which such performance guarantees can be derived.

While these models still make many unrealistic simplifying assumptions, Kosoy views DRL as already capturing some of the most essential features of the problem—and she has a fairly ambitious vision of how this framework might be further developed.

Kosoy previously described DRL in the post Delegative Reinforcement Learning with a Merely Sane Advisor. One feature of DRL Kosoy described here but omitted from the paper (for space reasons) is DRL’s application to corruption. Given certain assumptions, DRL ensures that a formal agent will never have its reward or advice channel tampered with (corrupted). As a special case, the agent’s own advisor cannot cause the agent to enter a corrupt state. Similarly, the general protection from traps described in “Delegative reinforcement learning” also protects the agent from harmful self-modifications.

Another set of DRL results that didn’t make it into the paper is Catastrophe Mitigation Using DRL. In this variant, a DRL agent can mitigate catastrophes that the advisor would not be able to mitigate on its own—something that isn’t supported by the more strict assumptions about the advisor in standard DRL.
 

Sign up to get updates on new MIRI technical results

Get notified every time a new technical paper is published.

 

April 2019 Newsletter

 |   |  Newsletters

New grants from the Open Philanthropy Project and BERI

 |   |  News

I’m happy to announce that MIRI has received two major new grants:

The Open Philanthropy Project’s grant was awarded as part of the first round of grants recommended by their new committee for effective altruism support:

We are experimenting with a new approach to setting grant sizes for a number of our largest grantees in the effective altruism community, including those who work on long-termist causes. Rather than have a single Program Officer make a recommendation, we have created a small committee, comprised of Open Philanthropy staff and trusted outside advisors who are knowledgeable about the relevant organizations. […] We average the committee members’ votes to arrive at final numbers for our grants.

The Open Philanthropy Project’s grant is separate from the three-year $3.75 million grant they awarded us in 2017, the third $1.25 million disbursement of which is still scheduled for later this year. This new grant increases the Open Philanthropy Project’s total support for MIRI from $1.4 million1 in 2018 to ~$2.31 million in 2019, but doesn’t reflect any decision about how much total funding MIRI might receive from Open Phil in 2020 (beyond the fact that it will be at least ~$1.06 million).

Going forward, the Open Philanthropy Project currently plans to determine the size of any potential future grants to MIRI using the above committee structure.

We’re very grateful for this increase in support from BERI and the Open Philanthropy Project—both organizations that already numbered among our largest funders of the past few years. We expect these grants to play an important role in our decision-making as we continue to grow our research team in the ways described in our 2018 strategy update and fundraiser posts.

  1. The $1.4 million counts the Open Philanthropy Project’s $1.25 million disbursement in 2018, as well as a $150,000 AI Safety Retraining Program grant to MIRI. []

March 2019 Newsletter

 |   |  Newsletters