New paper: “Safely interruptible agents”

 |   |  Papers

Safely Interruptible AgentsGoogle DeepMind Research Scientist Laurent Orseau and MIRI Research Associate Stuart Armstrong have written a new paper on error-tolerant agent designs, “Safely interruptible agents.” The paper is forthcoming at the 32nd Conference on Uncertainty in Artificial Intelligence.


Reinforcement learning agents interacting with a complex environment like the real world are unlikely to behave optimally all the time. If such an agent is operating in real-time under human supervision, now and then it may be necessary for a human operator to press the big red button to prevent the agent from continuing a harmful sequence of actions—harmful either for the agent or for the environment—and lead the agent into a safer situation. However, if the learning agent expects to receive rewards from this sequence, it may learn in the long run to avoid such interruptions, for example by disabling the red button — which is an undesirable outcome.

This paper explores a way to make sure a learning agent will not learn to prevent (or seek!) being interrupted by the environment or a human operator. We provide a formal definition of safe interruptibility and exploit the off-policy learning property to prove that either some agents are already safely interruptible, like Q-learning, or can easily be made so, like Sarsa. We show that even ideal, uncomputable reinforcement learning agents for (deterministic) general computable environments can be made safely interruptible.

Orseau and Armstrong’s paper constitutes a new angle of attack on the problem of corrigibility. A corrigible agent is one that recognizes it is flawed or under development and assists its operators in maintaining, improving, or replacing itself, rather than resisting such attempts.

In the case of superintelligent AI systems, corrigibility is primarily aimed at averting unsafe convergent instrumental policies (e.g., the policy of defending its current goal system from future modifications) when such systems have incorrect terminal goals. This leaves us more room for approximate, trial-and-error, and learning-based solutions to AI value specification.

Interruptibility is an attempt to formalize one piece of the intuitive idea of corrigibility. Utility indifference (in Soares, Fallenstein, Yudkowsky, and Armstrong’s “Corrigibility”) is an example of a past attempt to define a different piece of corrigibility: systems that are indifferent to programmers’ interventions to modify their terminal goals, and will therefore avoid trying to force their programmers either to make such modifications or to avoid such modifications. “Safely interruptible agents” instead attempts to define systems that are indifferent to programmers’ interventions to modify their policies, and will not try to stop programmers from intervening on their everyday activities (nor try to force them to intervene).

Here the goal is to make the agent’s policy converge to whichever policy is optimal if the agent believed there would be no future interruptions. Even if the agent has experienced interruptions in the past, it should act just as though it will never experience any further interruptions. Orseau and Armstrong show that several classes of agent are safely interruptible, or can be easily made safely interruptible.

Further reading:



Sign up to get updates on new MIRI technical results

Get notified every time a new technical paper is published.

  • Joshua Cogliati

    First of all it is an interesting paper, and the definition and concept of safe interruptibility is useful. I think however that the constraints required, as I understand the paper, seem severely limiting. For Theorem 7, in order to use it requires that the optimal policy π^μ already exists. Obtaining an optimal policy is a hard problem in itself. For the online safe interruptible versions required in section 3, as I read the paper require int-GLIE. For definition 10(b), the interruptible policy visits each state-action pair infinitely often. From the introduction, “there may be physical safety constraints during learning”, which seems to mean that there are state-action pairs that cannot be allowed, which would violate the int-GLIE requirement. So I think that this paper at best illustrates how much research is needed before safe interruptibility is possible in anything but a safe environment.