New paper: “Delegative reinforcement learning”

 |   |  Papers

Delegative Reinforcement LearningMIRI Research Associate Vanessa Kosoy has written a new paper, “Delegative reinforcement learning: Learning to avoid traps with a little help.” Kosoy will be presenting the paper at the ICLR 2019 SafeML workshop in two weeks. The abstract reads:

Most known regret bounds for reinforcement learning are either episodic or assume an environment without traps. We derive a regret bound without making either assumption, by allowing the algorithm to occasionally delegate an action to an external advisor. We thus arrive at a setting of active one-shot model-based reinforcement learning that we call DRL (delegative reinforcement learning.)

The algorithm we construct in order to demonstrate the regret bound is a variant of Posterior Sampling Reinforcement Learning supplemented by a subroutine that decides which actions should be delegated. The algorithm is not anytime, since the parameters must be adjusted according to the target time discount. Currently, our analysis is limited to Markov decision processes with finite numbers of hypotheses, states and actions.

The goal of Kosoy’s work on DRL is to put us on a path toward having a deep understanding of learning systems with human-in-the-loop and formal performance guarantees, including safety guarantees. DRL tries to move us in this direction by providing models in which such performance guarantees can be derived.

While these models still make many unrealistic simplifying assumptions, Kosoy views DRL as already capturing some of the most essential features of the problem—and she has a fairly ambitious vision of how this framework might be further developed.

Kosoy previously described DRL in the post Delegative Reinforcement Learning with a Merely Sane Advisor. One feature of DRL Kosoy described here but omitted from the paper (for space reasons) is DRL’s application to corruption. Given certain assumptions, DRL ensures that a formal agent will never have its reward or advice channel tampered with (corrupted). As a special case, the agent’s own advisor cannot cause the agent to enter a corrupt state. Similarly, the general protection from traps described in “Delegative reinforcement learning” also protects the agent from harmful self-modifications.

Another set of DRL results that didn’t make it into the paper is Catastrophe Mitigation Using DRL. In this variant, a DRL agent can mitigate catastrophes that the advisor would not be able to mitigate on its own—something that isn’t supported by the more strict assumptions about the advisor in standard DRL.
 

Sign up to get updates on new MIRI technical results

Get notified every time a new technical paper is published.

 

April 2019 Newsletter

 |   |  Newsletters

New grants from the Open Philanthropy Project and BERI

 |   |  News

I’m happy to announce that MIRI has received two major new grants:

The Open Philanthropy Project’s grant was awarded as part of the first round of grants recommended by their new committee for effective altruism support:

We are experimenting with a new approach to setting grant sizes for a number of our largest grantees in the effective altruism community, including those who work on long-termist causes. Rather than have a single Program Officer make a recommendation, we have created a small committee, comprised of Open Philanthropy staff and trusted outside advisors who are knowledgeable about the relevant organizations. […] We average the committee members’ votes to arrive at final numbers for our grants.

The Open Philanthropy Project’s grant is separate from the three-year $3.75 million grant they awarded us in 2017, the third $1.25 million disbursement of which is still scheduled for later this year. This new grant increases the Open Philanthropy Project’s total support for MIRI from $1.4 million1 in 2018 to ~$2.31 million in 2019, but doesn’t reflect any decision about how much total funding MIRI might receive from Open Phil in 2020 (beyond the fact that it will be at least ~$1.06 million).

Going forward, the Open Philanthropy Project currently plans to determine the size of any potential future grants to MIRI using the above committee structure.

We’re very grateful for this increase in support from BERI and the Open Philanthropy Project—both organizations that already numbered among our largest funders of the past few years. We expect these grants to play an important role in our decision-making as we continue to grow our research team in the ways described in our 2018 strategy update and fundraiser posts.


  1. The $1.4 million counts the Open Philanthropy Project’s $1.25 million disbursement in 2018, as well as a $150,000 AI Safety Retraining Program grant to MIRI. 

March 2019 Newsletter

 |   |  Newsletters

Applications are open for the MIRI Summer Fellows Program!

 |   |  News

CFAR and MIRI are running our fifth annual MIRI Summer Fellows Program (MSFP) in the San Francisco Bay Area from August 9 to August 24, 2019.

MSFP is an extended retreat for mathematicians and programmers with a serious interest in making technical progress on the problem of AI alignment. It includes an overview of CFAR’s applied rationality content, a breadth-first grounding in the MIRI perspective on AI safety, and multiple days of actual hands-on research with participants and MIRI staff attempting to make inroads on open questions.

Read more »

A new field guide for MIRIx

 |   |  News

We’ve just released a field guide for MIRIx groups, and for other people who want to get involved in AI alignment research.

MIRIx is a program where MIRI helps cover basic expenses for outside groups that want to work on open problems in AI safety. You can start your own group or find information on existing meet-ups at intelligence.org/mirix.

Several MIRIx groups have recently been ramping up their activity, including:

  • UC Irvine: Daniel Hermann is starting a MIRIx group in Irvine, California. Contact him if you’d like to be involved.
  • Seattle: MIRIxSeattle is a small group that’s in the process of restarting and increasing its activities. Contact Pasha Kamyshev if you’re interested.
  • Vancouver: Andrew McKnight and Evan Gaensbauer are looking for more people who’d like to join MIRIxVancouver events.

The new alignment field guide is intended to provide tips and background models to MIRIx groups, based on our experience of what tends to make a research group succeed or fail.

The guide begins:


Preamble I: Decision Theory

Hello! You may notice that you are reading a document.

This fact comes with certain implications. For instance, why are you reading this? Will you finish it? What decisions will you come to as a result? What will you do next?

Notice that, whatever you end up doing, it’s likely that there are dozens or even hundreds of other people, quite similar to you and in quite similar positions, who will follow reasoning which strongly resembles yours, and make choices which correspondingly match.

Given that, it’s our recommendation that you make your next few decisions by asking the question “What policy, if followed by all agents similar to me, would result in the most good, and what does that policy suggest in my particular case?” It’s less of a question of trying to decide for all agents sufficiently-similar-to-you (which might cause you to make the wrong choice out of guilt or pressure) and more something like “if I were in charge of all agents in my reference class, how would I treat instances of that class with my specific characteristics?”

If that kind of thinking leads you to read further, great. If it leads you to set up a MIRIx chapter, even better. In the meantime, we will proceed as if the only people reading this document are those who justifiably expect to find it reasonably useful.

Preamble II: Surface Area

Imagine that you have been tasked with moving a cube of solid iron that is one meter on a side. Given that such a cube weighs ~16000 pounds, and that an average human can lift ~100 pounds, a naïve estimation tells you that you can solve this problem with ~150 willing friends.

But of course, a meter cube can fit at most something like 10 people around it. It doesn’t matter if you have the theoretical power to move the cube if you can’t bring that power to bear in an effective manner. The problem is constrained by its surface area.

MIRIx chapters are one of the best ways to increase the surface area of people thinking about and working on the technical problem of AI alignment. And just as it would be a bad idea to decree “the 10 people who happen to currently be closest to the metal cube are the only ones allowed to think about how to think about this problem”, we don’t want MIRI to become the bottleneck or authority on what kinds of thinking can and should be done in the realm of embedded agency and other relevant fields of research.

The hope is that you and others like you will help actually solve the problem, not just follow directions or read what’s already been written. This document is designed to support people who are interested in doing real groundbreaking research themselves.

(Read more)

 

February 2019 Newsletter

 |   |  Newsletters

Thoughts on Human Models

 |   |  Analysis

This is a joint post by MIRI Research Associate and DeepMind Research Scientist Ramana Kumar and MIRI Research Fellow Scott Garrabrant, cross-posted from the AI Alignment Forum and LessWrong.


Human values and preferences are hard to specify, especially in complex domains. Accordingly, much AGI safety research has focused on approaches to AGI design that refer to human values and preferences indirectly, by learning a model that is grounded in expressions of human values (via stated preferences, observed behaviour, approval, etc.) and/or real-world processes that generate expressions of those values. There are additionally approaches aimed at modelling or imitating other aspects of human cognition or behaviour without an explicit aim of capturing human preferences (but usually in service of ultimately satisfying them). Let us refer to all these models as human models.

In this post, we discuss several reasons to be cautious about AGI designs that use human models. We suggest that the AGI safety research community put more effort into developing approaches that work well in the absence of human models, alongside the approaches that rely on human models. This would be a significant addition to the current safety research landscape, especially if we focus on working out and trying concrete approaches as opposed to developing theory. We also acknowledge various reasons why avoiding human models seems difficult.

 

Problems with Human Models

To be clear about human models, we draw a rough distinction between our actual preferences (which may not be fully accessible to us) and procedures for evaluating our preferences. The first thing, actual preferences, is what humans actually want upon reflection. Satisfying our actual preferences is a win. The second thing, procedures for evaluating preferences, refers to various proxies for our actual preferences such as our approval, or what looks good to us (with necessarily limited information or time for thinking). Human models are in the second category; consider, as an example, a highly accurate ML model of human yes/no approval on the set of descriptions of outcomes. Our first concern, described below, is about overfitting to human approval and thereby breaking its connection to our actual preferences. (This is a case of Goodhart’s law.)

Read more »