New paper: “Incorrigibility in the CIRL Framework”

 |   |  Papers

Incorrigibility in the CIRL Framework

MIRI assistant research fellow Ryan Carey has a new paper out discussing situations where good performance in Cooperative Inverse Reinforcement Learning (CIRL) tasks fails to imply that software agents will assist or cooperate with programmers.

The paper, titled “Incorrigibility in the CIRL Framework,” lays out four scenarios in which CIRL violates the four conditions for corrigibility defined in Soares et al. (2015). Abstract:

A value learning system has incentives to follow shutdown instructions, assuming the shutdown instruction provides information (in the technical sense) about which actions lead to valuable outcomes. However, this assumption is not robust to model mis-specification (e.g., in the case of programmer errors). We demonstrate this by presenting some Supervised POMDP scenarios in which errors in the parameterized reward function remove the incentive to follow shutdown commands. These difficulties parallel those discussed by Soares et al. (2015) in their paper on corrigibility.

We argue that it is important to consider systems that follow shutdown commands under some weaker set of assumptions (e.g., that one small verified module is correctly implemented; as opposed to an entire prior probability distribution and/or parameterized reward function). We discuss some difficulties with simple ways to attempt to attain these sorts of guarantees in a value learning framework.

The paper is a response to a paper by Hadfield-Menell, Dragan, Abbeel, and Russell, “The Off-Switch Game.” Hadfield-Menell et al. show that an AI system will be more responsive to human inputs when it is uncertain about its reward function and thinks that its human operator has more information about this reward function. Carey shows that the CIRL framework can be used to formalize the problem of corrigibility, and that the known assurances for CIRL systems, given in “The Off-Switch Game”, rely on strong assumptions about having an error-free CIRL system. With less idealized assumptions, a value learning agent may have beliefs that cause it to evade redirection from the human.

[T]he purpose of a shutdown button is to shut the AI system down in the event that all other assurances failed, e.g., in the event that the AI system is ignoring (for one reason or another) the instructions of the operators. If the designers of [the AI system] R have programmed the system so perfectly that the prior and [reward function] R are completely free of bugs, then the theorems of Hadfield-Menell et al. (2017) do apply. In practice, this means that in order to be corrigible, it would be necessary to have an AI system that was uncertain about all things that could possibly matter. The problem is that performing Bayesian reasoning over all possible worlds and all possible value functions is quite intractable. Realistically, humans will likely have to use a large number of heuristics and approximations in order to implement the system’s belief system and updating rules. […]

Soares et al. (2015) seem to want a shutdown button that works as a mechanism of last resort, to shut an AI system down in cases where it has observed and refused a programmer suggestion (and the programmers believe that the system is malfunctioning). Clearly, some part of the system must be working correctly in order for us to expect the shutdown button to work at all. However, it seems undesirable for the working of the button to depend on there being zero critical errors in the specification of the system’s prior, the specification of the reward function, the way it categorizes different types of actions, and so on. Instead, it is desirable to develop a shutdown module that is small and simple, with code that could ideally be rigorously verified, and which ideally works to shut the system down even in the event of large programmer errors in the specification of the rest of the system.

In order to do this in a value learning framework, we require a value learning system that (i) is capable of having its actions overridden by a small verified module that watches for shutdown commands; (ii) has no incentive to remove, damage, or ignore the shutdown module; and (iii) has some small incentive to keep its shutdown module around; even under a broad range of cases where R, the prior, the set of available actions, etc. are misspecified.

Even if the utility function is learned, there is still a need for additional lines of defense against unintended failures. The hope is that this can be achieved by modularizing the AI system. For that purpose, we would need a model of an agent that will behave corrigibly in a way that is robust to misspecification of other system components.


Sign up to get updates on new MIRI technical results

Get notified every time a new technical paper is published.


August 2017 Newsletter

 |   |  Newsletters

Research updates

General updates

News and links

July 2017 Newsletter

 |   |  Newsletters

A number of major mid-year MIRI updates: we received our largest donation to date, $1.01 million from an Ethereum investor! Our research priorities have also shifted somewhat, reflecting the addition of four new full-time researchers (Marcello Herreshoff, Sam Eisenstat, Tsvi Benson-Tilsen, and Abram Demski) and the departure of Patrick LaVictoire and Jessica Taylor.

Research updates

General updates

News and links

Updates to the research team, and a major donation

 |   |  News

We have several major announcements to make, covering new developments in the two months since our 2017 strategy update:

1. On May 30th, we received a surprise $1.01 million donation from an Ethereum cryptocurrency investor. This is the single largest contribution we have received to date by a large margin, and will have a substantial effect on our plans over the coming year.

2. Two new full-time researchers are joining MIRI: Tsvi Benson-Tilsen and Abram Demski. This comes in the wake of Sam Eisenstat and Marcello Herreshoff’s addition to the team in May. We’ve also begun working with engineers on a trial basis for our new slate of software engineer job openings.

3. Two of our researchers have recently left: Patrick LaVictoire and Jessica Taylor, researchers previously heading work on our “Alignment for Advanced Machine Learning Systems” research agenda.

For more details, see below.

Read more »

June 2017 Newsletter

 |   |  Newsletters

Research updates

General updates

News and links

May 2017 Newsletter

 |   |  Newsletters

Research updates

General updates

  • Our strategy update discusses changes to our AI forecasts and research priorities, new outreach goals, a MIRI/DeepMind collaboration, and other news.
  • MIRI is hiring software engineers! If you’re a programmer who’s passionate about MIRI’s mission and wants to directly support our research efforts, apply here to trial with us.
  • MIRI Assistant Research Fellow Ryan Carey has taken on an additional affiliation with the Centre for the Study of Existential Risk, and is also helping edit an issue of Informatica on superintelligence.

News and links

2017 Updates and Strategy

 |   |  MIRI Strategy

In our last strategy update (August 2016), Nate wrote that MIRI’s priorities were to make progress on our agent foundations agenda and begin work on our new “Alignment for Advanced Machine Learning Systems” agenda, to collaborate and communicate with other researchers, and to grow our research and ops teams.

Since then, senior staff at MIRI have reassessed their views on how far off artificial general intelligence (AGI) is and concluded that shorter timelines are more likely than they were previously thinking. A few lines of recent evidence point in this direction, such as:1

  • AI research is becoming more visibly exciting and well-funded. This suggests that more top talent (in the next generation as well as the current generation) will probably turn their attention to AI.
  • AGI is attracting more scholarly attention as an idea, and is the stated goal of top AI groups like DeepMind, OpenAI, and FAIR. In particular, many researchers seem more open to thinking about general intelligence now than they did a few years ago.
  • Research groups associated with AGI are showing much clearer external signs of profitability.
  • AI successes like AlphaGo indicate that it’s easier to outperform top humans in domains like Go (without any new conceptual breakthroughs) than might have been expected.2 This lowers our estimate for the number of significant conceptual breakthroughs needed to rival humans in other domains.

There’s no consensus among MIRI researchers on how long timelines are, and our aggregated estimate puts medium-to-high probability on scenarios in which the research community hasn’t developed AGI by, e.g., 2035. On average, however, research staff now assign moderately higher probability to AGI’s being developed before 2035 than we did a year or two ago. This has a few implications for our strategy:

1. Our relationships with current key players in AGI safety and capabilities play a larger role in our strategic thinking. Short-timeline scenarios reduce the expected number of important new players who will enter the space before we hit AGI, and increase how much influence current players are likely to have.

2. Our research priorities are somewhat different, since shorter timelines change what research paths are likely to pay out before we hit AGI, and also concentrate our probability mass more on scenarios where AGI shares various features in common with present-day machine learning systems.

Both updates represent directions we’ve already been trending in for various reasons.3 However, we’re moving in these two directions more quickly and confidently than we were last year. As an example, Nate is spending less time on staff management and other administrative duties than in the past (having handed these off to MIRI COO Malo Bourgon) and less time on broad communications work (having delegated a fair amount of this to me), allowing him to spend more time on object-level research, research prioritization work, and more targeted communications.4

I’ll lay out what these updates mean for our plans in more concrete detail below.

Read more »

  1. Note that this list is far from exhaustive. 
  2. Relatively general algorithms (plus copious compute) were able to surpass human performance on Go, going from incapable of winning against the worst human professionals in standard play to dominating the very best professionals in the space of a few months. The relevant development here wasn’t “AlphaGo represents a large conceptual advance over previously known techniques,” but rather “contemporary techniques run into surprisingly few obstacles when scaled to tasks as pattern-recognition-reliant and difficult (for humans) as professional Go”. 
  3. The publication of “Concrete Problems in AI Safety” last year, for example, caused us to reduce the time we were spending on broad-based outreach to the AI community at large in favor of spending more time building stronger collaborations with researchers we knew at OpenAI, Google Brain, DeepMind, and elsewhere. 
  4. Nate continues to set MIRI’s organizational strategy, and is responsible for the ideas in this post. 

Software Engineer Internship / Staff Openings

 |   |  News

The Machine Intelligence Research Institute is looking for highly capable software engineers to directly support our AI alignment research efforts, with a focus on projects related to machine learning. We’re seeking engineers with strong programming skills who are passionate about MIRI’s mission and looking for challenging and intellectually engaging work.

While our goal is to hire full-time, we are initially looking for paid interns. Successful internships may then transition into staff positions.

About the Internship Program

The start time for interns is flexible, but we’re aiming for May or June. We will likely run several batches of internships, so if you are interested but unable to start in the next few months, do still apply. The length of the internship is flexible, but we’re aiming for 2–3 months.

Examples of the kinds of work you’ll do during the internship:

  • Replicate recent machine learning papers, and implement variations.
  • Learn about and implement machine learning tools (including results in the fields of deep learning, convex optimization, etc.).
  • Run various coding experiments and projects, either independently or in small groups.
  • Rapidly prototype, implement, and test AI alignment ideas related to machine learning (after demonstrating successes in the above points).

For MIRI, the benefit of this program is that it’s a great way to get to know you and assess you for a potential hire. For applicants, the benefits are that this is an excellent opportunity to get your hands dirty and level up your machine learning skills, and to get to the cutting edge of the AI safety field, with a potential to stay in a full-time engineering role after the internship concludes.

Our goal is to trial many more people than we expect to hire, so our threshold for keeping on engineers long-term as full staff will be higher than for accepting applicants to our internship.

The Ideal Candidate

Some qualities of the ideal candidate:

  • Extensive breadth and depth of programming skills. Machine learning experience is not required, though it is a plus.
  • Highly familiar with basic ideas related to AI alignment.
  • Able to work independently with minimal supervision, and in team/group settings.
  • Willing to accept a below-market rate. Since MIRI is a non-profit, we can’t compete with the Big Names in the Bay Area.
  • Enthusiastic about the prospect of working at MIRI and helping advance the field of AI alignment.
  • Not looking for a “generic” software engineering position.

Working at MIRI

We strive to make working at MIRI a rewarding experience.

  • Modern Work Spaces — Many of us have adjustable standing desks with large external monitors. We consider workspace ergonomics important, and try to rig up work stations to be as comfortable as possible. Free snacks, drinks, and meals are also provided at our office.
  • Flexible Hours — We don’t have strict office hours, and we don’t limit employees’ vacation days. Our goal is to make rapid progress on our research agenda, and we would prefer that staff take a day off than that they extend tasks to fill an extra day.
  • Living in the Bay Area — MIRI’s office is located in downtown Berkeley, California. From our office, you’re a 30-second walk to the BART (Bay Area Rapid Transit), which can get you around the Bay Area; a 3-minute walk to UC Berkeley campus; and a 30-minute BART ride to downtown San Francisco.

EEO & Employment Eligibility

MIRI is an equal opportunity employer. We are committed to making employment decisions based on merit and value. This commitment includes complying with all federal, state, and local laws. We desire to maintain a work environment free of harassment or discrimination due to sex, race, religion, color, creed, national origin, sexual orientation, citizenship, physical or mental disability, marital status, familial status, ethnicity, ancestry, status as a victim of domestic violence, age, or any other status protected by federal, state, or local laws.


If interested, click here to apply. For questions or comments, email Matt Graves (

Update (December 2017): We’re now putting less emphasis on finding interns and looking for highly skilled engineers available for full-time work. Updated job post here.