Powerful planners, not sentient software

 |   |  Analysis

Over the past few months, some major media outlets have been spreading concern about the idea that AI might spontaneously acquire sentience and turn against us. Many people have pointed out the flaws with this notion, including Andrew Ng, an AI scientist of some renown:

I don’t see any realistic path from the stuff we work on today—which is amazing and creating tons of value—but I don’t see any path for the software we write to turn evil.

He goes on to say, on the topic of sentient machines:

Computers are becoming more intelligent and that’s useful as in self-driving cars or speech recognition systems or search engines. That’s intelligence. But sentience and consciousness is not something that most of the people I talk to think we’re on the path to.

I say, these objections are correct. I endorse Ng’s points wholeheartedly — I see few pathways via which software we write could spontaneously “turn evil.”

I do think that there is important work we need to do in advance if we want to be able to use powerful AI systems for the benefit of all, but this is not because a powerful AI system might acquire some “spark of consciousness” and turn against us. I also don’t worry about creating some Vulcan-esque machine that deduces (using cold mechanic reasoning) that it’s “logical” to end humanity, that we are in some fashion “unworthy.” The reason to do research in advance is not so fantastic as that. Rather, we simply don’t yet know how to program intelligent machines to reliably do good things without unintended consequences.

The problem isn’t Terminator. It’s “King Midas.” King Midas got exactly what he wished for — every object he touched turned to gold. His food turned to gold, his children turned to gold, and he died hungry and alone.

Powerful intelligent software systems are just that: software systems. There is no spark of consciousness which descends upon sufficiently powerful planning algorithms and imbues them with feelings of love or hatred. You get only what you program.1

To build a powerful AI software system, you need to write a program that represents the world somehow, and that continually refines this world-model in response to percepts and experience. You also need to program powerful planning algorithms that use this world-model to predict the future and find paths that lead towards futures of some specific type.

The focus of our research at MIRI isn’t centered on sentient machines that think or feel as we do. It’s aimed towards improving our ability to program software systems to execute plans leading towards very specific types of futures.

A machine programmed to build a highly accurate world-model and employ powerful planning algorithms could yield extraordinary benefits. Scientific and technological innovation have had great impacts on quality of life around the world, and if we can program machines to be intelligent in the way that humans are intelligent — only faster and better — we can automate scientific and technological innovation. When it comes to the task of improving human and animal welfare, that would be a game-changer.

To build a machine that attains those benefits, the first challenge is to do this world-modeling and planning in a highly reliable fashion: you need to ensure that it will consistently pursue its goal, whatever that is. If you can succeed at this, the second challenge is making that goal a safe and useful one.

If you build a powerful planning system that aims at futures in which cancer is cured, then it may well represent all of the following facts in its world-model: (a) The fastest path to a cancer cure involves proliferating robotic laboratories at the expense of the biosphere and kidnapping humans for experimentation; (b) once you realize this, you’ll attempt to shut it down; and (c) if you shut it down, it will take a lot longer for cancer to be cured. The system may then execute a plan which involves deceiving you until it is able to resist and then proliferating robotic laboratories and kidnapping humans. This is, in fact, what you asked for.

We can avoid this sort of outcome, if we manage to build machines that do what we mean rather than what we said. That sort of behavior doesn’t come for free: you have to program it in.

A superhuman planning algorithm with an extremely good model of the world could find solutions you never imagined. It can make use of patterns you haven’t noticed and find shortcuts you didn’t recognize. If you follow a plan generated by a superintelligent search process, it could have disastrous unintended consequences. To quote professor Stuart Russell (author of the leading AI textbook):

The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

Humans have a lot of fiddly little constraints akin to “oh, and don’t kidnap any humans while you’re curing cancer”. Programming in a full description of human values and human norms by hand, in a machine-readable format, doesn’t seem feasible. If we want the plans generated by superhuman planning algorithms to respect all of our complicated unspoken constraints and desires, then we’ll need to develop new tools for predicting and controlling the behavior of general-purpose autonomous agents. There’s no two ways about it.

Many people, when they first encounter this problem, come up with a reflexive response about why the problem won’t be as hard as it seems. One common one is “If a powerful planner starts running amok, we can just unplug it” — an objection which is growing obsolete in the era of cloud computing, and which fails completely if the system has access to the internet or any other network where it can copy itself onto other machines.

Another common one is “Why not have the system output a plan rather than having it execute the plan?” — but if we direct a powerful planning procedure to generate plans such that (a) humans who examine the plan approve of it and (b) executing it leads to cancer being cured, then the plan may well be one that looks good but which exploits some predictable oversight in the verification procedure and kidnaps people anyway.

Or you could say, “How about we just make systems which only answer questions?” But how exactly do you direct a superhuman planning procedures towards “answering questions”? Will you program it to output text that it predicts will cause you to press the “highly satisfied” button after the answer has been output? Because in that case, the system may well output text that constitutes a particularly deceptive answer. Or, if you add a constraint that the answer must be accurate, it may output text that manipulates you into asking easier questions in the future.

Maybe you reply, “Well, perhaps instead I’ll direct the planner to move toward futures where its output is measured by this clever metric where…,” and now you’ve been drawn in. How exactly could we build powerful planers that search for beneficial futures? It looks like it’s possible to build systems that somehow learn the user’s intentions or values and act according to them, but actually doing so is not trivial. You’ve got to think hard to build systems that figure out all the intricacies of your intentions without deceiving or manipulating you while acquiring that information. That doesn’t happen for free: ambitious, long-term software projects are still ultimately software projects, and we have to figure out how to actually write the required code.

If we can figure out how to build smarter-than-human machines aligned with our interests, the benefits could be extraordinary. Like Phil Libin (founder of Evernote) says, AI could be “one of the greatest forces for good the universe has ever seen.” It’s possible to get there, but it’s going to require some work.


  1. You could likely program an AI system to be conscious, which would greatly complicate the situation — for then the system itself would be a moral patient, and its preferences would weigh into our considerations. As Ng notes, however, “consciousness” is not the same thing as “intelligence.” 
  • Sarah Markham

    What is the event and process that is ‘consciousness’ were a feature unique to the physical properties of organic (biological) structures similar to our human brains? If we were to engineer such carbon based computing systems would it be possible to create living artificial intelligence?

  • Hank Smith

    We can exploit the known limitations of any computer solution – its fundamentally inability to do what humans can do. We do that now with captcha. Those fundamental limitations are well known. The computer will always produce a limited result, no matter what its intelligence level. The limitations can be found by humans easily. Consider this to be like the oral arguments phase of a court decision. In this case the computer could produce a nice looking brief, but fail at oral arguments immediately.

    Human abilities are vast and unfathomed. They are not the result of just neural activity. Humans create and circumvent Godel Incompleteness on a routine, casual basis. This is why captcha’s are decisive.

  • Koli Mitra

    Two layperson questions:

    (1) why would it use deception? If the
    program’s instruction is ONLY to make recommendations and the
    “intelligence” doesn’t have preferences or desires of its own, then why
    wouldn’t it simply make the recommendations? Is the idea that the program (which is in control of finding the optimal way to do whatever it is we want its recommendation on) might consider human resistance as an obstacle and hence, make that obstacle part of the consideration in terms what to “recommend” ?

    (2) It sounds like you’re saying that instead of fearing AI that has actual “human-like” consciousness, we should actually worry more about AI precisely because it won’t have those powers. Would it be better (assuming it’s possible) to try to build a more human-mind like AI, with something like “judgment” or “common sense” as a way to avoid the kind of “Midas touch” problem you describe?