Safety engineering, target selection, and alignment theory

 |   |  Analysis


Artificial intelligence capabilities research is aimed at making computer systems more intelligent — able to solve a wider range of problems more effectively and efficiently. We can distinguish this from research specifically aimed at making AI systems at various capability levels safer, or more “robust and beneficial.” In this post, I distinguish three kinds of direct research that might be thought of as “AI safety” work: safety engineering, target selection, and alignment theory.

Imagine a world where humans somehow developed heavier-than-air flight before developing a firm understanding of calculus or celestial mechanics. In a world like that, what work would be needed in order to safely transport humans to the Moon?

In this case, we can say that the main task at hand is one of engineering a rocket and refining fuel such that the rocket, when launched, accelerates upwards and does not explode. The boundary of space can be compared to the boundary between narrowly intelligent and generally intelligent AI. Both boundaries are fuzzy, but have engineering importance: spacecraft and aircraft have different uses and face different constraints.

Paired with this task of developing rocket capabilities is a safety engineering task. Safety engineering is the art of ensuring that an engineered system provides acceptable levels of safety. When it comes to achieving a soft landing on the Moon, there are many different roles for safety engineering to play. One team of engineers might ensure that the materials used in constructing the rocket are capable of withstanding the stress of a rocket launch with significant margin for error. Another might design escape systems that ensure the humans in the rocket can survive even in the event of failure. Another might design life support systems capable of supporting the crew in dangerous environments.

A separate important task is target selection, i.e., picking where on the Moon to land. In the case of a Moon mission, targeting research might entail things like designing and constructing telescopes (if they didn’t exist already) and identifying a landing zone on the Moon. Of course, only so much targeting can be done in advance, and the lunar landing vehicle may need to be designed so that it can alter the landing target at the last minute as new data comes in; this again would require feats of engineering.

Beyond the task of (safely) reaching escape velocity and figuring out where you want to go, there is one more crucial prerequisite for landing on the Moon. This is rocket alignment research, the technical work required to reach the correct final destination. We’ll use this as an analogy to illustrate MIRI’s research focus, the problem of artificial intelligence alignment.

The alignment challenge

Hitting a certain target on the Moon isn’t as simple as carefully pointing the nose of the rocket at the relevant lunar coordinate and hitting “launch” — not even if you trust your pilots to make course corrections as necessary. There’s also the important task of plotting trajectories between celestial bodies.

Image credit: NASA/Bill Ingalls

This rocket alignment task may require a distinct body of theoretical knowledge that isn’t required just for getting a payload off of the planet. Without calculus, designing a functional rocket would be enormously difficult. Still, with enough tenacity and enough resources to spare, we could imagine a civilization reaching space after many years of trial and error — at which point they would be confronted with the problem that reaching space isn’t sufficient for steering toward a specific location.1

The first rocket alignment researchers might ask, “What trajectory would we have our rocket take under ideal conditions, without worrying about winds or explosions or fuel efficiency?” If even that question were beyond their current abilities, they might simplify the problem still further, asking, “At what angle and velocity would we fire a cannonball such that it enters a stable orbit around Earth, assuming that Earth is perfectly spherical and has no atmosphere?”

To an early rocket engineer, for whom even the problem of building any vehicle that makes it off the launch pad remains a frustrating task, the alignment theorist’s questions might look out-of-touch. The engineer may ask “Don’t you know that rockets aren’t going to be fired out of cannons?” or “What does going in circles around the Earth have to do with getting to the Moon?” Yet understanding rocket alignment is quite important when it comes to achieving a soft landing on the Moon. If you don’t yet know at what angle and velocity to fire a cannonball such that it would end up in a stable orbit on a perfectly spherical planet with no atmosphere, then you may need to develop a better understanding of celestial mechanics before you attempt a Moon mission.

Three forms of AI safety research

The case is similar with AI research. AI capabilities work comes part and parcel with associated safety engineering tasks. Working today, an AI safety engineer might focus on making the internals of large classes of software more transparent and interpretable by humans. They might ensure that the system fails gracefully in the face of adversarial observations. They might design security protocols and early warning systems that help operators prevent or handle system failures.2

AI safety engineering is indispensable work, and it’s infeasible to separate safety engineering from capabilities engineering. Day-to-day safety work in aerospace engineering doesn’t rely on committees of ethicists peering over engineers’ shoulders. Some engineers will happen to spend their time on components of the system that are there for reasons of safety — such as failsafe mechanisms or fallback life-support — but safety engineering is an integral part of engineering for safety-critical systems, rather than a separate discipline.

In the domain of AI, target selection addresses the question: if one could build a powerful AI system, what should one use it for? The potential development of superintelligence raises a number of thorny questions in theoretical and applied ethics. Some of those questions can plausibly be resolved in the near future by moral philosophers and psychologists, and by the AI research community. Others will undoubtedly need to be left to the future. Stuart Russell goes so far as to predict that “in the future, moral philosophy will be a key industry sector.” We agree that this is an important area of study, but it is not the main focus of the Machine Intelligence Research Institute.

Researchers at MIRI focus on problems of AI alignment: the study of how in principle to direct a powerful AI system towards a specific goal. Where target selection is about the destination of the “rocket” (“what effects do we want AI systems to have on our civilization?”) and AI capabilities engineering is about getting the rocket to escape velocity (“how do we make AI systems powerful enough to help us achieve our goals?”), alignment is about knowing how to aim rockets towards particular celestial bodies (“assuming we could build highly capable AI systems, how would we direct them at our targets?”). Since our understanding of AI alignment is still at the “what is calculus?” stage, we ask questions analogous to “at what angle and velocity would we fire a cannonball to put it in a stable orbit, if Earth were perfectly spherical and had no atmosphere?”

Selecting promising AI alignment research paths is not a simple task. With the benefit of hindsight, it’s easy enough to say that early rocket alignment researchers should begin by inventing calculus and studying gravitation. For someone who doesn’t yet have a clear understanding of what “calculus” or “gravitation” are, however, choosing research topics might be quite a bit more difficult. The fruitful research directions would need to compete with fruitless ones, such as studying aether or Aristotelian physics; and which research programs are fruitless may not be obvious in advance.

Toward a theory of alignable agents

What are some plausible candidates for the role of “calculus” or “gravitation” in the field of AI?

Image credit: Brian Brondel

At MIRI, we currently focus on subjects such as good reasoning under deductive limitations (logical uncertainty), decision theories that work well even for agents embedded in large environments, and reasoning procedures that approve of the way they reason. This research often involves building toy models and studying problems under dramatic simplifications, analogous to assuming a perfectly spherical Earth with no atmosphere.

Developing theories of logical uncertainty isn’t what most people have in mind when they think of “AI safety research.” A natural thought here is to ask what specifically goes wrong if we don’t develop such theories. If an AI system can’t perform bounded reasoning in the domain of mathematics or logic, that doesn’t sound particularly “unsafe” — a system that needs to reason mathematically but can’t might be fairly useless, but it’s harder to see it becoming dangerous.

On our view, understanding logical uncertainty is important for helping us understand the systems we build well enough to justifiably conclude that they can be aligned in the first place. An analogous question in the case of rocket alignment might run: “If you don’t develop calculus, what bad thing happens to your rocket? Do you think the pilot will be struggling to make a course correction, and find that they simply can’t add up the tiny vectors fast enough?” The answer, though, isn’t that the pilot might struggle to correct their course, but rather that the trajectory that you thought led to the moon takes the rocket wildly off-course. The point of developing calculus is not to allow the pilot to make course corrections quickly; the point is to make it possible to discuss curved rocket trajectories in a world where the best tools available assume that rockets move in straight lines.

The case is similar with logical uncertainty. The problem is not that we visualize a specific AI system encountering a catastrophic failure because it mishandles logical uncertainty. The problem is that our best existing tools for analyzing rational agency assume that those agents are logically omniscient, making our best theories incommensurate with our best practical AI designs.3

At this point, the goal of alignment research is not to solve particular engineering problems. The goal of early rocket alignment research would be to develop shared language and tools for generating and evaluating rocket trajectories, which will require developing calculus and celestial mechanics if they do not already exist. Similarly, the goal of AI alignment research is to develop shared language and tools for generating and evaluating methods by which powerful AI systems could be designed to act as intended.

One might worry that it is difficult to set benchmarks of success for alignment research. Is a Newtonian understanding of gravitation sufficient to attempt a Moon landing, or must one develop a complete theory of general relativity before believing that one can land softly on the Moon?4

In the case of AI alignment, there is at least one obvious benchmark to focus on initially. Imagine we possessed an incredibly powerful computer with access to the internet, an automated factory, and large sums of money. If we could program that computer to reliably achieve some simple goal (such as producing as much diamond as possible), then a large share of the AI alignment research would be completed. This is because a large share of the problem is in understanding autonomous systems that are stable, error-tolerant, and demonstrably aligned with some goal. Developing the ability to steer rockets in some direction with confidence is harder than developing the additional ability to steer rockets to a specific lunar location.

The pursuit of a goal such as this one is more or less MIRI’s approach to AI alignment research. We think of this as our version of the question, “Could you hit the Moon with a rocket if fuel and winds were no concern?” Answering that question, on its own, won’t ensure that smarter-than-human AI systems are aligned with our goals; but it would represent a major advance over our current knowledge, and it doesn’t look like the kind of basic insight that we can safely skip over.

What next?

Over the past year, we’ve seen a massive increase in attention towards the task of ensuring that future AI systems are robust and beneficial. AI safety work is being taken very seriously, and AI engineers are stepping up and acknowledging that safety engineering is not separable from capabilities engineering. It is becoming apparent that as the field of artificial intelligence matures, safety engineering will become a more and more firmly embedded part of AI culture. Meanwhile, new investigations of target selection and other safety questions will be showcased at an AI and Ethics workshop at AAAI-16, one of the larger annual conferences in the field.

A fourth variety of safety work is also receiving increased support: strategy research. If your nation is currently engaged in a cold war and locked in a space race, you may well want to consult with game theorists and strategists so as to ensure that your attempts to put a person on the Moon do not upset a delicate political balance and lead to a nuclear war.5 If international coalitions will be required in order to establish treaties regarding the use of space, then diplomacy may also become a relevant aspect of safety work. The same principles hold when it comes to AI, where coalition-building and global coordination may play an important role in the technology’s development and use.

Strategy research has been on the rise this year. AI Impacts is producing strategic analyses relevant to the designers of this potentially world-changing technology, and will soon be joined by the Strategic Artificial Intelligence Research Centre. The new Leverhulme Centre for the Future of Intelligence will be pulling together people across many different disciplines to study the social impact of AI, forging new collaborations. The Global Priorities Project, meanwhile, is analyzing what types of interventions might be most effective at ensuring positive outcomes from the development of powerful AI systems.

The field is moving fast, and these developments are quite exciting. Throughout it all, though, AI alignment research in particular still seems largely under-served.

MIRI is not the only group working on AI alignment; a handful of researchers from other organizations and institutions are also beginning to ask similar questions. MIRI’s particular approach to AI alignment research is by no means the only way one available — when first thinking about how to put humans on the Moon, one might want to consider both rockets and space elevators. Regardless of who does the research or where they do it, it is important that alignment research receive attention.

Smarter-than-human AI systems may be many decades away, and they may not closely resemble any existing software. This limits our ability to identify productive safety engineering approaches. At the same time, the difficulty of specifying our values makes it difficult to identify productive research in moral theory. Alignment research has the advantage of being abstract enough to be potentially applicable to a wide variety of future computing systems, while being formalizable enough to admit of unambiguous progress. By prioritizing such work, therefore, we believe that the field of AI safety will be able to ground itself in technical work without losing sight of the most consequential questions in AI.

Safety engineering, moral theory, strategy, and general collaboration-building are all important parts of the project of developing safe and useful AI. On the whole, these areas look poised to thrive as a result of the recent rise in interest in long-term outcomes, and I’m thrilled to see more effort and investment going towards those important tasks.

The question is: What do we need to invest in next? The type of growth that I most want to see happen in the AI community next would be growth in AI alignment research, via the formation of new groups or organizations focused primarily on AI alignment and the expansion of existing AI alignment teams at MIRI, UC Berkeley, the Future of Humanity Institute at Oxford, and other institutions.

Before trying to land a rocket on the Moon, it’s important that we know how we would put a cannonball into a stable orbit. Absent a good theoretical understanding of rocket alignment, it might well be possible for a civilization to eventually reach escape velocity; but getting somewhere valuable and exciting and new, and getting there reliably, is a whole extra challenge.


My thanks to Eliezer Yudkowsky for introducing the idea behind this post, and to Lloyd Strohl III, Rob Bensinger, and others for helping review the content.


  1. Similarly, we could imagine a civilization that lives on the only planet in its solar system, or lives on a planet with perpetual cloud cover obscuring all objects except the Sun and Moon. Such a civilization might have an adequate understanding of terrestrial mechanics while lacking a model of celestial mechanics and lacking the knowledge that the same dynamical laws hold on Earth and in space. There would then be a gap in experts’ theoretical understanding of rocket alignment, distinct from gaps in their understanding of how to reach escape velocity. 
  2. Roman Yampolskiy has used the term “AI safety engineering” to refer to the study of AI systems that can provide proofs of their safety for external verification, including some theoretical research that we would term “alignment research.” His usage differs from the usage here. 
  3. Just as calculus is valuable both for building rockets that can reach escape velocity and for directing rockets towards specific lunar coordinates, a formal understanding of logical uncertainty might be useful both for improving AI capabilities and for improving the degree to which we can align powerful AI systems. The main motivation for studying logical uncertainty is that many other AI alignment problems are blocked on models of deductively limited reasoners, in the same way that trajectory-plotting could be blocked on models of curved paths. 
  4. In either case, of course, we wouldn’t want to put a moratorium on the space program while we wait for a unified theory of quantum mechanics and general relativity. We don’t need a perfect understanding of gravity. 
  5. This was a role historically played by the RAND corporation
  • http://utilitarian-reasoning.atwebpages.com/ Saulius

    “Imagine we had access to an incredibly powerful computer with access to the internet, an automated factory, and large sums of money. If we could program that computer to reliably achieve some simple goal (such as producing as much diamond as possible), then a large share of the AI alignment research would be completed.”

    – what does this has to do with AI safety? I must say, despite my best effort, I failed to understand what is alignment in AI research.

    • http://www.nothingismere.com/ Rob Bensinger

      This is explained in more detail on https://intelligence.org/2015/07/27/miris-approach/. As AI systems become more capable, it will become more important that we be able to trust them to do the right thing on their own. “Do the right thing” is an incredibly general demand, and the more general AI systems become, the harder it will be to specify exactly what we mean by that. So we’ll need to be able to trust the reasoning and decision-making procedures of AI systems on some fundamental level. Rather than looking over each individual action the system outputs and checking its correctness, we’ll need to crack open the hood and analyze the underlying algorithms used to arrive at actions.

      As it stands, however, our theory of intelligent agency is nowhere near being able to model what it means to pursue goals, reason about large environments, reason about the agent itself, or any number of other extremely basic aspects of interacting with the world. We lack tools for formalizing those concepts even in extremely simple environments.

      If we don’t even know what it would mean for an agent to pursue some toy goal in a purely theoretical context where we make the mathematics as simple as possible, then any attempt to design practical systems that can autonomously learn what we want (and act on it) is likely to be misdirected.

      It might be possible to build highly capable AI systems without fully understanding why they behave the way they do; but if we want to align those systems with our values (rather than just make them as smart as possible), then we’ll need to put in the work to understand not just how to implement intelligence on a computer, but how to implement intelligence in such a way that it can be directed toward a particular precise goal. (Similarly, it might be possible to get off-Earth without understanding celestial mechanics, with enough trial and error; but if you care about the direction you go on, not just the distance, then you need a deeper understanding of the problem you’re trying to solve.)

      • http://utilitarian-reasoning.atwebpages.com/ Saulius

        Thank you for answering. Let me just check if I understand the distinction correctly:
        Target selection: what goal to give to AGI? (e.g. Coherent Extrapolated Volition tackles this issue)
        Alignment: How to make AGI understand what a “goal” is? (even when there is no black box that gives it a score after every attempt)

        • http://www.nothingismere.com/ Rob Bensinger

          Target selection is learning more about how we want AGI to change the world. It’s about the destination, rather than the journey. For example, a targeting researcher might conclude, “People care more about wealth inequality than about absolute levels of wealth; so if we put AI systems in charge of making important macroeconomic decisions, they should have that effect.” Or it might conclude the opposite.

          Alignment is learning more about how to build an AGI that’s actually able to produce the desired changes. It’s about the journey, rather than the destination. Part of that might be deciding what goal to give an AI system, or what method it should use to learn a goal. But the bulk of AI safety work probably won’t be directly about giving AGI systems the right goals. It will be about designing the entire system — including world-modeling, self-modeling, and planning components, not just goals — to be stable, transparent to human inspection, robust to context changes, etc. There are many different open problems here, and modeling goals for non-reinforcement-learners is just one of them. See https://intelligence.org/files/TechnicalAgenda.pdf for more examples.

          • http://utilitarian-reasoning.atwebpages.com/ Saulius

            thank you very much for explaining

  • zarzuelazen

    Let me take some wild stabs in the dark here.

    The two fields you want are INFORMATION THEORY and CONCEPT-LEARNING.

    Information Theory extends and supersedes decision theory. It is analogous to ‘gravitation’ in your
    story. Concept Learning extends and supersedes probability theory. It is analogous to ‘calculus’ in your story.

    Whereas in Decision Theory you were aiming to select the action that best achieves goals by maximizing utility, in the generalized version of Information Theory you’re aiming to select the best representation of goals by minimizing the complexity (reducing cognitive work-load).

    Whereas in Probability Theory you were aiming to assign subjective ‘weight of evidence’ to
    find the hypothesis that best fits the observations, in the generalized version of Concept Learning, you’re aiming to assign ‘degree of coherence’ to find the model of the concept that best integrates with the overall world-model of the system.

    Best guesses 😉

    • http://www.nothingismere.com/ Rob Bensinger

      Information theory is central to a lot of the work we do. It will certainly be a component in any full analysis of logical uncertainty, for example.

      Marcus Hutter’s work on AIXI is an example of an attempt to combine algorithmic information theory with a decision-making procedure to better formalize the idea of intelligence. So one starting point for our research is to ask what kinds of problem even Hutter’s definition of a perfectly intelligent agent couldn’t solve, e.g., problems where the agent’s physical embodiment are relevant to task success. The goal isn’t so much to find what “supersedes” decision theory and probability theory (though we’ll eventually need more computationally tractable versions of both) as to identify specific places where our current theories break down, ask why they break down, and develop solutions.

      For more information, see: http://intelligence.org/technical-agenda

      • zarzuelazen

        Yes, Information Theory is about fundamental limitations of information and signal processing. So I think these complexity limitations are indeed just like a sort of ‘gravitational force’, pulling your super-intelligence away from a straight-line path towards its goals.
        I’m not at all convinced that the ordinary notions of ‘utility’ and ‘probability’ still apply, when reasoning about cognitive systems. As I understand it, these concepts were designed for reasoning about the external world the agent was embedded in (actions and observations on/about the environment), not the agent itself.
        For instance, does it make sense to assign ‘probabilities’ to mathematical statements? I’m not convinced. And does it make sense to assign ‘utilities’ to actions that involve changes to AI programs? Again, not at all clear. The notions of ‘utility’ and ‘probability’ might need to be replaced with more general concepts such as the ones I suggested (‘complexity’ and ‘coherence’ respectively).

  • http://mindey.com/ Mindey

    Let’s try a simpler problem, I’d call the “undo problem”. Suppose our world is a Petri dish, and our A.I. is a bacterium. Bacterium seeks to acquire all resources (diamonds) in the Petri dish medium (our world). The most effective way is through self-replication (suppose A.I. can figure it out and do it). Imagine that original inhabitants of the medium are also bacteria, and creation of superintelligence is comparable to introducing a gene for fast replication. How do we being the level of intelligence of a bacterium that thinks by only rarely introducing a mutation, undo or manage the process after introduction of bacterium with fast replication gene?

    I can see only several immediate solutions, one of them is introducing the advantageous mutation simultaneously to all, another is making a multi-cellular organism that we are part of. However, it seems neither of these will happen, because we already invented deep learning, which we don’t fully understand why it works, but many will find it strategically very beneficial, and perhaps will blindly invest into it for capital gains… If we want to survive, I think it is wise to apply the newly discovered learning methods FIRST to “general collaboration-building”, and only LATER to other fields. We could ask — “how can all people of the world cooperate more efficiently?”, “how to create conditions where people take globally optimal decisions?”, “how can we reach true consensus about world’s goals?”, etc.

  • Houshalter

    What concerns me about MIRI’s research isn’t that they think about idealized models. It’s that they expect the actual product to be an idealized model. They want to make an AI that is mathematically provable to be safe.

    I don’t have a word for it, but there’s this weird behavior I’ve seen mathematicians do. And that I have done myself. Where if a solution isn’t mathematically perfect and elegant and proven, then it must be wrong.

    We didn’t go to the moon in a perfect rocket, we did the best we could with what we had. It wasn’t 100% safe. Guaranteed safety is of course impossible, and if we spent all our time trying the Russians would have gotten there first.