New paper: “Incorrigibility in the CIRL Framework”

 |   |  Papers

Incorrigibility in the CIRL Framework

MIRI assistant research fellow Ryan Carey has a new paper out discussing situations where good performance in Cooperative Inverse Reinforcement Learning (CIRL) tasks fails to imply that software agents will assist or cooperate with programmers.

The paper, titled “Incorrigibility in the CIRL Framework,” lays out four scenarios in which CIRL violates the four conditions for corrigibility defined in Soares et al. (2015). Abstract:

A value learning system has incentives to follow shutdown instructions, assuming the shutdown instruction provides information (in the technical sense) about which actions lead to valuable outcomes. However, this assumption is not robust to model mis-specification (e.g., in the case of programmer errors). We demonstrate this by presenting some Supervised POMDP scenarios in which errors in the parameterized reward function remove the incentive to follow shutdown commands. These difficulties parallel those discussed by Soares et al. (2015) in their paper on corrigibility.

We argue that it is important to consider systems that follow shutdown commands under some weaker set of assumptions (e.g., that one small verified module is correctly implemented; as opposed to an entire prior probability distribution and/or parameterized reward function). We discuss some difficulties with simple ways to attempt to attain these sorts of guarantees in a value learning framework.

The paper is a response to a paper by Hadfield-Menell, Dragan, Abbeel, and Russell, “The Off-Switch Game.” Hadfield-Menell et al. show that an AI system will be more responsive to human inputs when it is uncertain about its reward function and thinks that its human operator has more information about this reward function. Carey shows that the CIRL framework can be used to formalize the problem of corrigibility, and that the known assurances for CIRL systems, given in “The Off-Switch Game”, rely on strong assumptions about having an error-free CIRL system. With less idealized assumptions, a value learning agent may have beliefs that cause it to evade redirection from the human.

[T]he purpose of a shutdown button is to shut the AI system down in the event that all other assurances failed, e.g., in the event that the AI system is ignoring (for one reason or another) the instructions of the operators. If the designers of [the AI system] R have programmed the system so perfectly that the prior and [reward function] R are completely free of bugs, then the theorems of Hadfield-Menell et al. (2017) do apply. In practice, this means that in order to be corrigible, it would be necessary to have an AI system that was uncertain about all things that could possibly matter. The problem is that performing Bayesian reasoning over all possible worlds and all possible value functions is quite intractable. Realistically, humans will likely have to use a large number of heuristics and approximations in order to implement the system’s belief system and updating rules. […]

Soares et al. (2015) seem to want a shutdown button that works as a mechanism of last resort, to shut an AI system down in cases where it has observed and refused a programmer suggestion (and the programmers believe that the system is malfunctioning). Clearly, some part of the system must be working correctly in order for us to expect the shutdown button to work at all. However, it seems undesirable for the working of the button to depend on there being zero critical errors in the specification of the system’s prior, the specification of the reward function, the way it categorizes different types of actions, and so on. Instead, it is desirable to develop a shutdown module that is small and simple, with code that could ideally be rigorously verified, and which ideally works to shut the system down even in the event of large programmer errors in the specification of the rest of the system.

In order to do this in a value learning framework, we require a value learning system that (i) is capable of having its actions overridden by a small verified module that watches for shutdown commands; (ii) has no incentive to remove, damage, or ignore the shutdown module; and (iii) has some small incentive to keep its shutdown module around; even under a broad range of cases where R, the prior, the set of available actions, etc. are misspecified.

Even if the utility function is learned, there is still a need for additional lines of defense against unintended failures. The hope is that this can be achieved by modularizing the AI system. For that purpose, we would need a model of an agent that will behave corrigibly in a way that is robust to misspecification of other system components.

 

Sign up to get updates on new MIRI technical results

Get notified every time a new technical paper is published.

 

  • Sok Puppette

    I hope the whole field doesn’t go barreling down the corrigibility path, because I think corrigibility gives you nothing useful when you get to the end game.

    By the “end game” here, I almost certainly don’t mean the next 20 years, I’m not sure I mean the next 50 years, and I may not mean the next 100 years. I mean the time that (I think) MIRI is really interested in… when you get to the point of building something really, really intelligent, hopefully intentionally. I don’t know if the kind of godlike “superintelligence” people like to dream about is possible, but much more than human intelligence definitely is on the table.

    Presumably your reason for building something that smart is that you want it to do things for you. Personally, I think that the most compelling application is to have it run the world, because humans are definitely not good enough to be left in charge for the long term, especially if they have superhuman intelligences to order around. But even if you don’t share that, you want it to do very big, complicated things that you yourself can’t understand. And it seems likely to be so valuable that it, or copies of it, are everywhere. That leads to several implications.

    First, once it’s been out there for a while, you probably Really Don’t Want to Shut It Down. What would happen today if you shut down the electrical grid, or even just the Internet? Remember, the shutdown is going to have to be permanent, or at least for long enough to somehow correct (in the normal sense) a complicated program that’s already behaved in an unexpected way.

    Using the Soares definition of “corrigibility as an off switch”, you’d have to be pretty sure that the alternative to pushing the switch was a total apocalypse. That means that humans would be (rightfully) reluctant to pushit. The off switch only has value in the cases where humans notice, before it’s too late, that things are going very badly indeed… and convince themselves thoroughly enough to act on that knowledge in time. I wouldn’t want to rely on that. So corrigibility is never sufficient.

    Second, this agent you’re trying to restrain may be (probably is) doing things that are too complicated for you to understand, and doing them fast. The only things you can respond to are the ones you do understand. If you constrain the agent to only do things that you understand, at a speed you can process, you’ve removed a lot of its value… and, in the middle to early end game, you’ve left it open to being pwned by any competing agents that don’t have that constraint. So corrigibility seems likely to be costly, and could be disastrously costly in a not-improbable competitive arms race situation.

    Third, there are other ways that you might end up constraining the agent to meet the part of Soares corrigibility where it doesn’t try to manipulate you into leaving the off switch alone. Depending on how you formulate that, you might end up with an agent that would totally refuse to take your preferences into account in its actions, because if it did so it would be causing you not to turn it off. So corrigibility may make it harder to avoid pathological behavior.

    Fourth, I don’t think corrigibility is even an intuitively desirable property in the end game. Would you want to appoint a human, or group of humans, to be in charge of the decision to “correct” the global electrical grid by permanently shutting it off? Who’s going to be King of the World?

    I really think that at the very end, in the far future, you’re going to want to build the most powerful intelligence that’s ever existed… and it’s going to have to be an incorrigible system.

    It seems like a very bad idea to have concentrated on corrigibility up to that point. That leaves you making a huge architectural change right before you “go live” with a system you can’t outright defeat, let alone “correct” at its own sufferance.

    It seems to me that you need to get the intrinsic behavior right, and not get trapped in complexity trying to put firewalls around that behavior.

    By the way, another peanut-gallery pet peeve: you’ll notice that I didn’t use the word “utility” in there anywhere, only behavior.

    That’s because only von Neumann-Mogenstern-rational agents have interesting utility functions. A randomly chosen agent is vanishingly unlikely to be rational in that sense. As far as I know, exactly zero of the interestingly intelligent agents we know to exist today are VNM-rational.

    Yeah, you can model a random agent A using a VNM agent A-prime whose utility functions assigns value one to worlds in which A-prime behaves as A would have behaved, and value zero to worlds in which A-prime behaves otherwise. But that’s mathematical sophistry… all you’ve done is to wrap all the complexity you were trying to resolve inside that utility function. It’s still lurking in there waiting to get you. And you’re going to have to open it up as soon as you want to put any constraints on the behavior.

    Perhaps more to the point, I think there’s approximately zero chance that anybody’s going to build a superhumanly intelligent agent by explicity formulating a utility function and writing a program to maximize it. And, beyond that, it’s not obvious that you can build an interesting agent that you are sure is VNM-rational. So if you’re trying to develop a practical technology for constraining the behavior of artificial agents, it may not be a good idea to get addicted to utility functions.

    • http://www.nothingismere.com/ Rob Bensinger

      Corrigibility in the sense we care about implies good performance in the off-switch case (which is why we use it as a simple example), but the off-switch problem plausibly isn’t the most important use case for corrigibility. Corrigibility is the general idea of building systems that are motivated to be cooperative or deferential towards programmers, allowing (and perhaps even assisting in) modifications, e.g., because the system in some sense “recognizes that it’s a work in progress” (see https://arbital.com/p/corrigibility/ for details). This is a goal that potentially has a large number of safety applications; e.g., some form of corrigibility may be necessary for value learning (https://intelligence.org/files/ValueLearningProblem.pdf ).

      “The off switch only has value in the cases where humans notice, before it’s too late, that things are going very badly indeed… and convince themselves thoroughly enough to act on that knowledge in time. I wouldn’t want to rely on that. So corrigibility is never sufficient.”

      This is true if your solution to corrigibility is very narrow. Corrigibility may be sufficient for averting a number of disasters on its own if there’s a way to solve something closer to the “hard” core of the problem described in https://arbital.com/p/hard_corrigibility/ .

      “Second, this agent you’re trying to restrain may be (probably is) doing things that are too complicated for you to understand, and doing them fast. The only things you can respond to are the ones you do understand. If you constrain the agent to only do things that you understand, at a speed you can process, you’ve removed a lot of its value… and, in the middle to early end game, you’ve left it open to being pwned by any competing agents that don’t have that constraint. So corrigibility seems likely to be costly, and could be disastrously costly in a not-improbable competitive arms race situation.”

      This and other parts of your comment make me think you’re thinking about this in broadly the right terms. “Don’t rely on humans reliably being in the loop (or on other approaches that will excessively weaken / slow down the system)” and “you need to get the intrinsic behavior right” are two of the core heuristics we use to filter out unpromising problems from promising ones. We just think corrigibility is one of the most promising alternatives by those lights, after searching for promising angles of attack on AI safety.

      Beyond the fact that we’re thinking of “corrigibility” as something more basic and general than just off-switch performance, other possible sources for the disagreement are that we’re thinking in terms of the strategy outlined in Dewey’s “Long-term strategies for ending existential risk from fast takeoff” paper, and we think value learning is too difficult to get exactly right on the first try.

      “Yeah, you can model a random agent A using a VNM agent A-prime whose utility functions assigns value one to worlds in which A-prime behaves as A would have behaved, and value zero to worlds in which A-prime behaves otherwise. But that’s mathematical sophistry…”

      Yes, that’s not why we sometimes use VNM-rational agent frameworks. Rather, it’s ‘many arguments that hold for VNM-rational agents also hold for other smart agents, and it’s relatively simple and easy to illustrate why if we talk in terms of VNM agents’ (e.g., it’s relatively easy to see why orthogonality and instrumental convergence are true in this case) plus ‘the better agents are at achieving real-world goals in a general-purpose way, the more closely they will tend to approximate rationality’ (https://arbital.com/p/coherence_theorems/ ).

      Researchers should avoid using toy models for corrigibility where the results will only end up being of any relevance for idealized agents, without generalizing to the kinds of agents we actually case about. Indeed, this is thrust of the Carey paper above.