Safety engineering, target selection, and alignment theory

 |   |  Analysis


Artificial intelligence capabilities research is aimed at making computer systems more intelligent — able to solve a wider range of problems more effectively and efficiently. We can distinguish this from research specifically aimed at making AI systems at various capability levels safer, or more “robust and beneficial.” In this post, I distinguish three kinds of direct research that might be thought of as “AI safety” work: safety engineering, target selection, and alignment theory.

Imagine a world where humans somehow developed heavier-than-air flight before developing a firm understanding of calculus or celestial mechanics. In a world like that, what work would be needed in order to safely transport humans to the Moon?

In this case, we can say that the main task at hand is one of engineering a rocket and refining fuel such that the rocket, when launched, accelerates upwards and does not explode. The boundary of space can be compared to the boundary between narrowly intelligent and generally intelligent AI. Both boundaries are fuzzy, but have engineering importance: spacecraft and aircraft have different uses and face different constraints.

Paired with this task of developing rocket capabilities is a safety engineering task. Safety engineering is the art of ensuring that an engineered system provides acceptable levels of safety. When it comes to achieving a soft landing on the Moon, there are many different roles for safety engineering to play. One team of engineers might ensure that the materials used in constructing the rocket are capable of withstanding the stress of a rocket launch with significant margin for error. Another might design escape systems that ensure the humans in the rocket can survive even in the event of failure. Another might design life support systems capable of supporting the crew in dangerous environments.

A separate important task is target selection, i.e., picking where on the Moon to land. In the case of a Moon mission, targeting research might entail things like designing and constructing telescopes (if they didn’t exist already) and identifying a landing zone on the Moon. Of course, only so much targeting can be done in advance, and the lunar landing vehicle may need to be designed so that it can alter the landing target at the last minute as new data comes in; this again would require feats of engineering.

Beyond the task of (safely) reaching escape velocity and figuring out where you want to go, there is one more crucial prerequisite for landing on the Moon. This is rocket alignment research, the technical work required to reach the correct final destination. We’ll use this as an analogy to illustrate MIRI’s research focus, the problem of artificial intelligence alignment.

The alignment challenge

Hitting a certain target on the Moon isn’t as simple as carefully pointing the nose of the rocket at the relevant lunar coordinate and hitting “launch” — not even if you trust your pilots to make course corrections as necessary. There’s also the important task of plotting trajectories between celestial bodies.

Image credit: NASA/Bill Ingalls

This rocket alignment task may require a distinct body of theoretical knowledge that isn’t required just for getting a payload off of the planet. Without calculus, designing a functional rocket would be enormously difficult. Still, with enough tenacity and enough resources to spare, we could imagine a civilization reaching space after many years of trial and error — at which point they would be confronted with the problem that reaching space isn’t sufficient for steering toward a specific location.1

The first rocket alignment researchers might ask, “What trajectory would we have our rocket take under ideal conditions, without worrying about winds or explosions or fuel efficiency?” If even that question were beyond their current abilities, they might simplify the problem still further, asking, “At what angle and velocity would we fire a cannonball such that it enters a stable orbit around Earth, assuming that Earth is perfectly spherical and has no atmosphere?”

To an early rocket engineer, for whom even the problem of building any vehicle that makes it off the launch pad remains a frustrating task, the alignment theorist’s questions might look out-of-touch. The engineer may ask “Don’t you know that rockets aren’t going to be fired out of cannons?” or “What does going in circles around the Earth have to do with getting to the Moon?” Yet understanding rocket alignment is quite important when it comes to achieving a soft landing on the Moon. If you don’t yet know at what angle and velocity to fire a cannonball such that it would end up in a stable orbit on a perfectly spherical planet with no atmosphere, then you may need to develop a better understanding of celestial mechanics before you attempt a Moon mission.

Three forms of AI safety research

The case is similar with AI research. AI capabilities work comes part and parcel with associated safety engineering tasks. Working today, an AI safety engineer might focus on making the internals of large classes of software more transparent and interpretable by humans. They might ensure that the system fails gracefully in the face of adversarial observations. They might design security protocols and early warning systems that help operators prevent or handle system failures.2

AI safety engineering is indispensable work, and it’s infeasible to separate safety engineering from capabilities engineering. Day-to-day safety work in aerospace engineering doesn’t rely on committees of ethicists peering over engineers’ shoulders. Some engineers will happen to spend their time on components of the system that are there for reasons of safety — such as failsafe mechanisms or fallback life-support — but safety engineering is an integral part of engineering for safety-critical systems, rather than a separate discipline.

In the domain of AI, target selection addresses the question: if one could build a powerful AI system, what should one use it for? The potential development of superintelligence raises a number of thorny questions in theoretical and applied ethics. Some of those questions can plausibly be resolved in the near future by moral philosophers and psychologists, and by the AI research community. Others will undoubtedly need to be left to the future. Stuart Russell goes so far as to predict that “in the future, moral philosophy will be a key industry sector.” We agree that this is an important area of study, but it is not the main focus of the Machine Intelligence Research Institute.

Researchers at MIRI focus on problems of AI alignment: the study of how in principle to direct a powerful AI system towards a specific goal. Where target selection is about the destination of the “rocket” (“what effects do we want AI systems to have on our civilization?”) and AI capabilities engineering is about getting the rocket to escape velocity (“how do we make AI systems powerful enough to help us achieve our goals?”), alignment is about knowing how to aim rockets towards particular celestial bodies (“assuming we could build highly capable AI systems, how would we direct them at our targets?”). Since our understanding of AI alignment is still at the “what is calculus?” stage, we ask questions analogous to “at what angle and velocity would we fire a cannonball to put it in a stable orbit, if Earth were perfectly spherical and had no atmosphere?”

Selecting promising AI alignment research paths is not a simple task. With the benefit of hindsight, it’s easy enough to say that early rocket alignment researchers should begin by inventing calculus and studying gravitation. For someone who doesn’t yet have a clear understanding of what “calculus” or “gravitation” are, however, choosing research topics might be quite a bit more difficult. The fruitful research directions would need to compete with fruitless ones, such as studying aether or Aristotelian physics; and which research programs are fruitless may not be obvious in advance.

Toward a theory of alignable agents

What are some plausible candidates for the role of “calculus” or “gravitation” in the field of AI?

Image credit: Brian Brondel

At MIRI, we currently focus on subjects such as good reasoning under deductive limitations (logical uncertainty), decision theories that work well even for agents embedded in large environments, and reasoning procedures that approve of the way they reason. This research often involves building toy models and studying problems under dramatic simplifications, analogous to assuming a perfectly spherical Earth with no atmosphere.

Developing theories of logical uncertainty isn’t what most people have in mind when they think of “AI safety research.” A natural thought here is to ask what specifically goes wrong if we don’t develop such theories. If an AI system can’t perform bounded reasoning in the domain of mathematics or logic, that doesn’t sound particularly “unsafe” — a system that needs to reason mathematically but can’t might be fairly useless, but it’s harder to see it becoming dangerous.

On our view, understanding logical uncertainty is important for helping us understand the systems we build well enough to justifiably conclude that they can be aligned in the first place. An analogous question in the case of rocket alignment might run: “If you don’t develop calculus, what bad thing happens to your rocket? Do you think the pilot will be struggling to make a course correction, and find that they simply can’t add up the tiny vectors fast enough?” The answer, though, isn’t that the pilot might struggle to correct their course, but rather that the trajectory that you thought led to the moon takes the rocket wildly off-course. The point of developing calculus is not to allow the pilot to make course corrections quickly; the point is to make it possible to discuss curved rocket trajectories in a world where the best tools available assume that rockets move in straight lines.

The case is similar with logical uncertainty. The problem is not that we visualize a specific AI system encountering a catastrophic failure because it mishandles logical uncertainty. The problem is that our best existing tools for analyzing rational agency assume that those agents are logically omniscient, making our best theories incommensurate with our best practical AI designs.3

At this point, the goal of alignment research is not to solve particular engineering problems. The goal of early rocket alignment research would be to develop shared language and tools for generating and evaluating rocket trajectories, which will require developing calculus and celestial mechanics if they do not already exist. Similarly, the goal of AI alignment research is to develop shared language and tools for generating and evaluating methods by which powerful AI systems could be designed to act as intended.

One might worry that it is difficult to set benchmarks of success for alignment research. Is a Newtonian understanding of gravitation sufficient to attempt a Moon landing, or must one develop a complete theory of general relativity before believing that one can land softly on the Moon?4

In the case of AI alignment, there is at least one obvious benchmark to focus on initially. Imagine we possessed an incredibly powerful computer with access to the internet, an automated factory, and large sums of money. If we could program that computer to reliably achieve some simple goal (such as producing as much diamond as possible), then a large share of the AI alignment research would be completed. This is because a large share of the problem is in understanding autonomous systems that are stable, error-tolerant, and demonstrably aligned with some goal. Developing the ability to steer rockets in some direction with confidence is harder than developing the additional ability to steer rockets to a specific lunar location.

The pursuit of a goal such as this one is more or less MIRI’s approach to AI alignment research. We think of this as our version of the question, “Could you hit the Moon with a rocket if fuel and winds were no concern?” Answering that question, on its own, won’t ensure that smarter-than-human AI systems are aligned with our goals; but it would represent a major advance over our current knowledge, and it doesn’t look like the kind of basic insight that we can safely skip over.

What next?

Over the past year, we’ve seen a massive increase in attention towards the task of ensuring that future AI systems are robust and beneficial. AI safety work is being taken very seriously, and AI engineers are stepping up and acknowledging that safety engineering is not separable from capabilities engineering. It is becoming apparent that as the field of artificial intelligence matures, safety engineering will become a more and more firmly embedded part of AI culture. Meanwhile, new investigations of target selection and other safety questions will be showcased at an AI and Ethics workshop at AAAI-16, one of the larger annual conferences in the field.

A fourth variety of safety work is also receiving increased support: strategy research. If your nation is currently engaged in a cold war and locked in a space race, you may well want to consult with game theorists and strategists so as to ensure that your attempts to put a person on the Moon do not upset a delicate political balance and lead to a nuclear war.5 If international coalitions will be required in order to establish treaties regarding the use of space, then diplomacy may also become a relevant aspect of safety work. The same principles hold when it comes to AI, where coalition-building and global coordination may play an important role in the technology’s development and use.

Strategy research has been on the rise this year. AI Impacts is producing strategic analyses relevant to the designers of this potentially world-changing technology, and will soon be joined by the Strategic Artificial Intelligence Research Centre. The new Leverhulme Centre for the Future of Intelligence will be pulling together people across many different disciplines to study the social impact of AI, forging new collaborations. The Global Priorities Project, meanwhile, is analyzing what types of interventions might be most effective at ensuring positive outcomes from the development of powerful AI systems.

The field is moving fast, and these developments are quite exciting. Throughout it all, though, AI alignment research in particular still seems largely under-served.

MIRI is not the only group working on AI alignment; a handful of researchers from other organizations and institutions are also beginning to ask similar questions. MIRI’s particular approach to AI alignment research is by no means the only way one available — when first thinking about how to put humans on the Moon, one might want to consider both rockets and space elevators. Regardless of who does the research or where they do it, it is important that alignment research receive attention.

Smarter-than-human AI systems may be many decades away, and they may not closely resemble any existing software. This limits our ability to identify productive safety engineering approaches. At the same time, the difficulty of specifying our values makes it difficult to identify productive research in moral theory. Alignment research has the advantage of being abstract enough to be potentially applicable to a wide variety of future computing systems, while being formalizable enough to admit of unambiguous progress. By prioritizing such work, therefore, we believe that the field of AI safety will be able to ground itself in technical work without losing sight of the most consequential questions in AI.

Safety engineering, moral theory, strategy, and general collaboration-building are all important parts of the project of developing safe and useful AI. On the whole, these areas look poised to thrive as a result of the recent rise in interest in long-term outcomes, and I’m thrilled to see more effort and investment going towards those important tasks.

The question is: What do we need to invest in next? The type of growth that I most want to see happen in the AI community next would be growth in AI alignment research, via the formation of new groups or organizations focused primarily on AI alignment and the expansion of existing AI alignment teams at MIRI, UC Berkeley, the Future of Humanity Institute at Oxford, and other institutions.

Before trying to land a rocket on the Moon, it’s important that we know how we would put a cannonball into a stable orbit. Absent a good theoretical understanding of rocket alignment, it might well be possible for a civilization to eventually reach escape velocity; but getting somewhere valuable and exciting and new, and getting there reliably, is a whole extra challenge.


My thanks to Eliezer Yudkowsky for introducing the idea behind this post, and to Lloyd Strohl III, Rob Bensinger, and others for helping review the content.


  1. Similarly, we could imagine a civilization that lives on the only planet in its solar system, or lives on a planet with perpetual cloud cover obscuring all objects except the Sun and Moon. Such a civilization might have an adequate understanding of terrestrial mechanics while lacking a model of celestial mechanics and lacking the knowledge that the same dynamical laws hold on Earth and in space. There would then be a gap in experts’ theoretical understanding of rocket alignment, distinct from gaps in their understanding of how to reach escape velocity. 
  2. Roman Yampolskiy has used the term “AI safety engineering” to refer to the study of AI systems that can provide proofs of their safety for external verification, including some theoretical research that we would term “alignment research.” His usage differs from the usage here. 
  3. Just as calculus is valuable both for building rockets that can reach escape velocity and for directing rockets towards specific lunar coordinates, a formal understanding of logical uncertainty might be useful both for improving AI capabilities and for improving the degree to which we can align powerful AI systems. The main motivation for studying logical uncertainty is that many other AI alignment problems are blocked on models of deductively limited reasoners, in the same way that trajectory-plotting could be blocked on models of curved paths. 
  4. In either case, of course, we wouldn’t want to put a moratorium on the space program while we wait for a unified theory of quantum mechanics and general relativity. We don’t need a perfect understanding of gravity. 
  5. This was a role historically played by the RAND corporation