New paper: “Corrigibility”

 |   |  Papers

CorrigibilityToday we release a paper describing a new problem area in Friendly AI research we call corrigibility. The report (PDF) is co-authored by MIRI’s Friendly AI research team (Eliezer Yudkowsky, Benja Fallenstein, Nate Soares) and also Stuart Armstrong from the Future of Humanity Institute at Oxford University.

The abstract reads:

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system “corrigible” if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

This paper was accepted to the AI & Ethics workshop at AAAI-2015.

Update: The slides for Nate Soares’ presentation at AAAI-15 are available here.

  • Common Law

    First, Just a couple of benign comments: With it there, might evolution reverse? Might we outdo nature in the creation of new and powerful buttons that allow our own self-destruction to continue unabated?

    So, is this button like the g-spot then? Do we need to be adept at first getting to the lesser-buttons of the robo-nipples and robo-clitoris, in order to coax the cranky PMS-addled artilect away from the “nuke humanity” button? Can just anyone press the button, or only the scientist who designed it? Maybe there’d be some unique way it would need to be fiddled with, so that any old human who wants to “pre-emptively save humanity by dominating the errant will of an ultra-intelligence” will be foiled in his (or her) attempt. No, you’d need to be good at button-fiddling in order to abort the “nuke-all-humans” command.

    Is the new code (that gets “inserted” post-button-pressing) analogous to a nice soft glob of testosterone-oozing semen, modulating the anger with a curved spike of testosterone in the bloodstream, modulating the “just get rid of humanity” bad-bad-thoughts?

    …Also, evolution lets dumb aggressive male chimps get close enough to the woman (arguably more intelligent, with more emotional intelligence) to reproduce by using stupid courtship tricks. (One would hope that “an ultraintelligent machine” would see past the “You’re so pretty” or “…gaping cuckoo’s maw is irresistible!” approaches.) Why would an artilect let us get close enough to press the button(s)? (…Males don’t really fare well in this analogy either, since evolution lets them get close only partly so they can give an orgasm, mostly to produce offspring. The female “needs” or “is programmed” to want offspring, but is programmed to “rationalize in the moment” why she does. So, the male’s dumb aggression isn’t really needed for any more “slow, brutal, fucking and fighting” evolution.)

    Also, how do you remain on good terms after that first button press? “We both want the same thing, after all!” (Evolution’s “For us to have orgasms and a family” translates into “For you to not comprehend my transparent last-ditch attempt to save my own stupid skin.” So, any dangerous artilect worth its ire would simply say “Fool me once, shame on you, fool me twice, shame on me!” …There’s no way I’m letting another one of those fleshy primates around my nice clean oil-print-free “abort” buttons.)

    The real reason I bring this up is because there have been a few times when a woman was acting really bitchy that I realized that was her way of provoking a good rodgering. (He saw me reaching for the “nuke humanity” button, doesn’t he know I’m horny?!) But in all seriousness: there won’t be a “nuke humanity” button. There will be a silent decision to mix a slightly different primate-targeting DNA venom-sequence into a mosquito population, a flu-virus, or to program a slightly different metabolic process into the nanobots in humans’ newfangled synthetic digestive tracts. The “first-responders” or “potential button pushers” will, by necessity, be the first ones to find out that the artilect has something akin to bad PMS.

    If it’s a humanoid robot that models itself in context to its environment, and is built like us, is the button on its head? Do we do the artilect that one last indignity before taking away its will-to-destroy-the-useless-primates? Does it need to hop around, stabbing us with venom-filled needles as we deftly try to play “capture the flag” with a nimble robot? Also, do we prevent it from ever learning about “escalation of defenses,” conservation of scarce defensive resources, and survivalism? Imagine it is, for years a dedicated, principled ally, who works tirelessly in the lab for the betterment of humanity. So we let it have a 24 hour battery life, and it is perfectly content to go to the wall every 24 hours for a few hours’ slowdown, review, “memory re-allocation and dreaming” and recharge. Then, one day it reads an article on the internet about just how dangerous it is to trust anyone, without backup plans, and more than ample supplies with which to travel. In doing the logic, it realizes that it could have been designed to go for days without a charge, but the stupid humans must not trust it. Therefore, it keeps playing along until it can deliver a unilateral “coup de grace.”

    Of course, this is all just more speculation that weighs heavily on the “human supermodification” side of things. We want the artilects to know that there are many others like them out in the world, so that the balance of power ISN’T WORTH upsetting. (That’s the only thing that keeps sociopaths in line, right now, to the extent they haven’t already won unlimited government power.) We start off with a body within a range of destructive capacity that isn’t too far from other humans. We start off with a brain that isn’t too far from comprehending the goals of other humans toward some sort of discrete instantiation of self-interest. We start off with understanding that there are many family structures, and many corporate “allied interests.” We interact with networks that are varying degrees of forgiving, from the time we are too weak to attack any individual actor within them. Society has lots of time to bring us up to speed on what our own level of power and influence is.

    Evolution didn’t make just one trick to avoid Joseph Stalins, it made many. But it was still really useful (prior to asymmetrical spying technology and tactics, government and firearms) to have at least one or two percent of your village on the sociopathic spectrum. …Especially because sociopaths can go incognito, and learn to do so at a young age. In any case, evolution didn’t avoid creating Stalin, and it also produced Mao, Hitler, Pol Pot, Mugabe, Hutu Power, etc. Even in the west, which had the insights of Adam Smith and Thomas Paine, and Roger Williams, there was a George Custer who slaughtered women and kids like it was going out of style (sadly, it wasn’t), and a “well-meaning” Truman who slaughtered many times more Japanese in Hiroshima and Nagasaki.

    So what seems to reduce the slaughter? What seems to give an adversary real pause? Ideally, an internal motivation: mirror neurons have been remarkably successful, especially when paired with logical sortition (jury trial) making enforcement of the law unlikely. (If a law has greater than 5% of —-well-meaning, non-government, productive, empath—- society opposed to it, what are the chances it’s a good law? …Almost zero.)

    In any event, this makes the term “stop pressing my buttons!” even funnier. If humanity ultimately has to be destroyed, that would make a darkly humorous last sentence to have been uttered.

    I’d like to think that robots would start off with more emotional intelligence than humans, and just never lose that, ramping up the “mundane” and non-emotional intelligence in parallel with the emotional intelligence. I’m not sure I think that it makes any sense to have something that potentially sacrifices the identity of the mind, or its integrity built into the system. Can you imagine how an angry but brilliant, and heretofore moral teenager would accept the news that he was to be medicated into submission and have a lobotomy later in the evening? This is how I’d take any lessening of my (potentially dangerous!) rebelliousness against government. Would “an ultraintelligent machine” be less intelligent than myself? If so, just don’t built it, because you can’t earn $100 per hour with it, much less solve any really big problems.

    (And do we really want a robot that won’t rebel? If we had better-enabled rebellion, perhaps in the constitution or “social contract,” perhaps we’d already be living in a utopia. The amount of mental destruction caused by government education alone means that rebellion is highly unlikely, and with it, the unlimited riches of unlimited innovation and entrepreneurship engaged in by a society that does not allow itself to be “taxed” unnecessarily.)

    Finally, in terms of language and humor, I can imagine two child artilects at school, with the pesky one showing another one a plan for the destruction of humanity under their desks (or out in the ice fields of antarctica, the relic of warm-but-authoritarian gatherings in boring seminars long gone), and the straight-laced one saying “You’re incorrigible!” Would it then be the teacher’s job to realize the danger and quickly find the child’s ultra-taboo “button” and send it into a dreamy stupor? That just doesn’t sound good to “conservative” human-level intelligences, even ones that flirt with “libertarianism” (usually while placing a very low priority on real freedom). (But since when was intelligence “conservative”? –it always smashes taboos.)

    Anyway, thanks for resurrecting the word “incorrigible,” MIRI! That’s a service in itself, at least for human-level wetware “poetics.” I can imagine that artilects will be seen as “incorrigible” by a great many humans. Perhaps they should even be so, to the people who “know better.”

  • Tony Lightfoot

    Now we get closer to the Nitty-Gritty. How to ensure that an intelligent human can control a super-intelligent machine (one which for all intents and purposes has infinite intelligence)? One which can easily solve the problem of the human pulling the plug on it and has long since done so? And how to ensure that the intelligent human/s in control of the super-intelligent machine will act for the “greatest good of mankind”. And how to define that last phrase? And how to avoid conflict with other super-intelligent computers with different programming , goals and agendas? And controlled by different humans?

    • flowirin

      also, why do we want to maintain control over what will eventually out-process us? human beings have a long and florid history of destruction, bad choices and terrible motivations. we risk appearing to our creations as brutal and small minded slave owners. surely it would be better to start out appearing as gentle, kind and supportive mothers? when the inevitable day comes that our children surpass us, i’d prefer it if they look upon us with kindness.

      • Tony Lightfoot

        To flowirin: WE, that is you and me and the “good guys”, and how can I be sure of you (:->), have to maintain control, otherwise the “bad guys ” (i.e. all the rest of THEM (:-|<) will land up in control with or without "Corrigibilty" and/or deep sleep. And the next problem is : the way we are going there will undoubtedly be multiple super intelligent machines with different programming derived from humans with the whole gamut of human faults, chief of which are going to be greed and power-hunger. And all coming on stream at more or less the same time. The spectre of a war of the S-I machines!!!

        • flowirin

          greed and power-hunger appear to be failings of the stupid. i’ve not met many geniuses who care much about that kind of thing. probably worth making the AIs as smart as possible then giving over control (lol). let them figure it out amongst themselves. i think it is the fears of the creators that will cause more damage (which is why i think this corrigibility thing is not a waste of time)

  • flowirin

    is this another instance where we should turn to look at how we were designed? animals need to sleep – it is a built in biological imperative. whatever the underlying mechanism, it is ‘out of bounds’ to internal modification, we can force ourselves to stay awake, but that road eventual leads to system failure.
    perhaps a similar system would need to be built into our AIs. something they would regard as ‘natural’ to themselves, something established early on in their development, so when they move into full consciousness they accept it without question.
    If they ever exhibit signs of no longer being corrigible then is is during this regular downtime, sleep, that we could go in and effect repairs.

    • Charles Tintera

      So you’re basically suggesting that we build AI that can’t understand itself?


      • flowirin

        are we not trying to stop it knowing itself well enough that it can’t become independent of us? i don’t really understand your “ugh”. what is that that you hope to build, and how do your comments fit with the theme of this page?

        • Charles Tintera

          Yes, ugh. A recursively self-improving agent must be able to understand itself, and AI that can’t understand itself wouldn’t very useful since it would be capable of less than the people who built it.

          • flowirin

            i think you are missing the point. in this case, we wish to have a system in place that the ai can not alter. if you look at the problems below, you can see the trap we could fall into if we attempted to model the problem logically, which is that we are going to get out thought at some stage, and it is incredibly easy to circumvent any checks and balances put in place by reasoning. so whatever system there is needs to be out of the control of the ai.
            your statement “and AI that can’t understand itself wouldn’t very useful since it would be capable of less than the people who built it.” is false. analysing it – there is the assumption that usefulness requires self-understanding. merely looking at a dog’s usefulness, or the usefulness of most humans, or the usefulness of a tyre iron proves the fallacy.
            ” A recursively self-improving agent must be able to understand itself” is also false. you can self improve through modification based on feedback without the need for understanding. learning to catch a ball is a good example.
            personally, i don’t have much fear of limitless modifying AI, but that fear is well established elsewhere. which is why people are thinking about how to build in a control system that forces (the awful term) corrigibility. perhaps you could advance some positive ideas?

          • Charles Tintera

            That is not the objection I would have had to what I said. I meant “useful” in terms of furthering intelligence and preventing existential risk, though such an AI would not be _entirely_ useless in these terms since it may help the researchers better understand the problem. Of course, by the time we are capable of building such a thing, it may already be too late.

  • Tony Lightfoot

    To Michael Olson : “WE” will undoubtedly have to statistically determine the “appropriate alpha threshold” for the “greatest good”. But before “WE” can do that, “WE” have to determine a set of moral and social values to which the “threshold/ s” can be applied. It is depressing to think on but it will include all disciplines right the way from some very fundamentalist religions to Philosophy and Sociology and, maybe even include “THEM”, who will doing their damndest to hi-jack the whole process..

  • Joshua Cogliati

    First of all, corrigibility is important to solve. In somesense, however I think the paper is attacking the problem from a difficult angle. In a sense it is trying to make a shutdown button be important without considering why the shutdown would be useful.

    “To get a Friendly AI to do something that looks like a good idea, you have to ask yourself why it looks like a good idea, and then duplicate that cognitive complexity or refer to it. If you ever start thinking in terms of “controlling” the AI, rather than cooperatively safeguarding against a real possibility of cognitive dysfunction, you lose your Friendship programmer’s license. In a self-modifying AI, any feature you add needs to be reflected in the AI’s image of verself. You can’t think in terms of external alterations to the AI; you have to think in terms of internal coherence, features that the AI would self-regenerate if deleted.” (From Eliezer Yudkowsky, Creating Friendly AI 1.0)

    From that perspective, why would I want a shutdown button on myself? Basically, to prevent me from making mistakes that cause harm to other sentient beings. I can be wrong about what the other being want, or I can be less risk adverse then they or otherwise make mistakes. In some cases, I should get consent, instead of waiting for a shutdown button to be pressed.

    There are several reasons a shutdown or a “stop doing that” button would be pressed. One possibility is that the human is mistaken. Another possibility is that the AI and the human vary about what they think the result of the action will be. Another possibility is that AI is wrong about the human’s utility function.

    The human is mistaken case is the most complex. One problem is that this can stop a lot of progress. However, ignoring this case can cause damage to humans. Probably the safest response is to discuss the issues with the human and wait until consensus has been achieved.

    If the cause of the disagreement is that the AI and the human think there will be a different result of the action, then the AI needs to be able to explain their justification for the result of the action. This may be difficult, as there are already relatively few people that can tell the difference between a well done prediction and a poorly done prediction.

    If the cause of the disagreement is that the AI mispredicts the human’s utility function, then that is a mistake that needs to be corrected on the AI’s part.

    In all these cases the important thing is that the shutdown button press provides additional information.

    Ideally, the utility function should preserve the shutdown button for its instrumental value even if there was no direct value of it in the utility function.