# Safety engineering, target selection, and alignment theory

|   |  Analysis

Artificial intelligence capabilities research is aimed at making computer systems more intelligent — able to solve a wider range of problems more effectively and efficiently. We can distinguish this from research specifically aimed at making AI systems at various capability levels safer, or more “robust and beneficial.” In this post, I distinguish three kinds of direct research that might be thought of as “AI safety” work: safety engineering, target selection, and alignment theory.

Imagine a world where humans somehow developed heavier-than-air flight before developing a firm understanding of calculus or celestial mechanics. In a world like that, what work would be needed in order to safely transport humans to the Moon?

In this case, we can say that the main task at hand is one of engineering a rocket and refining fuel such that the rocket, when launched, accelerates upwards and does not explode. The boundary of space can be compared to the boundary between narrowly intelligent and generally intelligent AI. Both boundaries are fuzzy, but have engineering importance: spacecraft and aircraft have different uses and face different constraints.

Paired with this task of developing rocket capabilities is a safety engineering task. Safety engineering is the art of ensuring that an engineered system provides acceptable levels of safety. When it comes to achieving a soft landing on the Moon, there are many different roles for safety engineering to play. One team of engineers might ensure that the materials used in constructing the rocket are capable of withstanding the stress of a rocket launch with significant margin for error. Another might design escape systems that ensure the humans in the rocket can survive even in the event of failure. Another might design life support systems capable of supporting the crew in dangerous environments.

A separate important task is target selection, i.e., picking where on the Moon to land. In the case of a Moon mission, targeting research might entail things like designing and constructing telescopes (if they didn’t exist already) and identifying a landing zone on the Moon. Of course, only so much targeting can be done in advance, and the lunar landing vehicle may need to be designed so that it can alter the landing target at the last minute as new data comes in; this again would require feats of engineering.

Beyond the task of (safely) reaching escape velocity and figuring out where you want to go, there is one more crucial prerequisite for landing on the Moon. This is rocket alignment research, the technical work required to reach the correct final destination. We’ll use this as an analogy to illustrate MIRI’s research focus, the problem of artificial intelligence alignment.

# The need to scale MIRI’s methods

|   |  Analysis

Andrew Critch, one of the new additions to MIRI’s research team, has taken the opportunity of MIRI’s winter fundraiser to write on his personal blog about why he considers MIRI’s work important. Some excerpts:

Since a team of CFAR alumni banded together to form the Future of Life Institute (FLI), organized an AI safety conference in Puerto Rico in January of this year, co-authored the FLI research priorities proposal, and attracted 10MM of grant funding from Elon Musk, a lot of money has moved under the label “AI Safety” in the past year. Nick Bostrom’s Superintelligence was also a major factor in this amazing success story. A lot of wonderful work is being done under these grants, including a lot of proposals for solutions to known issues with AI safety, which I find extremely heartening. However, I’m worried that if MIRI doesn’t scale at least somewhat to keep pace with all this funding, it just won’t be spent nearly as well as it would have if MIRI were there to help. We have to remember that AI safety did not become mainstream by a spontaneous collective awakening. It was through years of effort on the part of MIRI and collaborators at FHI struggling to identify unknown unknowns about how AI might surprise us, and struggling further to learn to explain these ideas in enough technical detail that they might be adopted by mainstream research, which is finally beginning to happen. But what about the parts we’re wrong about? What about the sub-problems we haven’t identified yet, that might end up neglected in the mainstream the same way the whole problem was neglected 5 years ago? I’m glad the AI/ML community is more aware of these issues now, but I want to make sure MIRI can grow fast enough to keep this growing field on track. Now, you might think that now that other people are “on the issue”, it’ll work itself out. That might be so. But just because some of MIRI’s conclusions are now being widely adopted widely doesn’t mean its methodology is. The mental movement “Someone has pointed out this safety problem to me, let me try to solve it!” is very different from “Someone has pointed out this safety solution to me, let me try to see how it’s broken!” And that second mental movement is the kind that allowed MIRI to notice AI safety problems in the first place. Cybersecurity professionals seem to carry out this movement easily: security expert Bruce Schneier calls it the security mindset. The SANS institute calls it red teaming. Whatever you call it, AI/ML people are still more in maker-mode than breaker-mode, and are not yet, to my eye, identifying any new safety problems. I do think that different organizations should probably try different approaches to the AI safety problem, rather than perfectly copying MIRI’s approach and research agenda. But I think breaker-mode/security mindset does need to be a part of every approach to AI safety. And if MIRI doesn’t scale up to keep pace with all this new funding, I’m worried that the world is just about to copy-paste MIRI’s best-2014-impression of what’s important in AI safety, and leave behind the self-critical methodology that generated these ideas in the first place… which is a serious pitfall given all the unknown unknowns left in the field. See our funding drive post to help contribute or to learn more about our plans. For more about AI risk and security mindset, see also Luke Muehlhauser’s post on the topic. # Jed McCaleb on Why MIRI Matters | | Guest Posts This is a guest post by Jed McCaleb, one of MIRI’s top donors, for our winter fundraiser. A few months ago, several leaders in the scientific community signed an open letter pushing for oversight into the research and development of artificial intelligence, in order to mitigate the risks and ensure the societal benefit of the advanced technology. Researchers largely agree that AI is likely to begin outperforming humans on most cognitive tasks in this century. Similarly, I believe we’ll see the promise of human-level AI come to fruition much sooner than we’ve fathomed. Its effects will likely be transformational — for the better if it is used to help improve the human condition, or for the worse if it is used incorrectly. As AI agents become more capable, it becomes more important to analyze and verify their decisions and goals. MIRI’s focus is on how we can create highly reliable agents that can learn human values and the overarching need for better decision-making processes that power these new technologies. The past few years has seen a vibrant and growing AI research community. As the space continues to flourish, the need for collaboration will continue to grow as well. Organizations like MIRI that are dedicated to security and safety engineering help fill this need. And, as a nonprofit, its research is free from profit obligations. This independence in research is important because it will lead to safer and more neutral results. By supporting organizations like MIRI, we’re putting the safeguards in place to make sure that this immensely powerful technology is used for the greater good. For humanity’s benefit, we need to guarantee that AI systems can reliably pursue goals that are aligned with society’s human values. If organizations like MIRI are able to help engineer this level of technological advancement and awareness in AI systems, imagine the endless possibilities of how it can help improve our world. It’s critical that we put the infrastructure in place in order to ensure that AI will be used to make the lives of people better. This is why I’ve donated to MIRI, and why I believe it’s a worthy cause that you should consider as well. Jed McCaleb created eDonkey, one of the largest file-sharing networks of its time, as well as Mt. Gox, the first Bitcoin exchange. Recognizing that the world’s financial infrastructure is broken and that too many people are left without resources, he cofounded Stellar in 2014. Jed is also an advisor to MIRI. # OpenAI and other news | | News We’re only 11 days into December, and this month is shaping up to be a momentous one. On December 3, the University of Cambridge partnered with the University of Oxford, Imperial College London, and UC Berkeley to launch the Leverhulme Centre for the Future of Intelligence. The Cambridge Centre for the Study of Existential Risk (CSER) helped secure initial funding for the new independent center, in the form of a15M grant to be disbursed over ten years. CSER and Leverhulme CFI plan to collaborate closely, with the latter focusing on AI’s mid- and long-term social impact.

Meanwhile, the Strategic Artificial Intelligence Research Centre (SAIRC) is hiring its first research fellows in machine learning, policy analysis, and strategy research: details. SAIRC will function as an extension of two existing institutions: CSER, and the Oxford-based Future of Humanity Institute. As Luke Muehlhauser has noted, if you’re an AI safety “lurker,” now is an ideal time to de-lurk and get in touch.

MIRI’s research program is also growing quickly, with mathematician Scott Garrabrant joining our core team tomorrow. Our winter fundraiser is in full swing, and multiple matching opportunities have sprung up to bring us within a stone’s throw of our first funding target.

# MIRI’s 2015 Winter Fundraiser!

|   |  News

The Machine Intelligence Research Institute’s 2015 winter fundraising drive begins today, December 1! Our current progress:

The drive will run for the month of December, and will help support MIRI’s research efforts aimed at ensuring that smarter-than-human AI systems have a positive impact.

#### MIRI’s Research Focus

The field of AI has a goal of automating perception, reasoning, and decision-making — the many abilities we group under the label “intelligence.” Most leading researchers in AI expect our best AI algorithms to begin strongly outperforming humans this century in most cognitive tasks. In spite of this, relatively little time and effort has gone into trying to identify the technical prerequisites for making smarter-than-human AI systems safe and useful.

We believe that several basic theoretical questions will need to be answered in order to make advanced AI systems stable, transparent, and error-tolerant, and in order to specify correct goals for such systems. Our technical agenda describes what we think are the most important and tractable of these questions.

Smarter-than-human AI may be 50 years or more away. There are a number of reasons we nonetheless consider it important to begin work on these problems today:

• High capability ceilings — Humans appear to be nowhere near physical limits for cognitive ability, and even modest advantages in intelligence may yield decisive strategic advantages for AI systems.
• “Sorcerer’s Apprentice” scenarios — Smarter AI systems can come up with increasingly creative ways to meet programmed goals. The harder it is to anticipate how a goal will be achieved, the harder it is to specify the correct goal.
• Convergent instrumental goals — By default, highly capable decision-makers are likely to have incentives to treat human operators adversarially.
• AI speedup effects — Progress in AI is likely to accelerate as AI systems approach human-level proficiency in skills like software engineering.

We think MIRI is well-positioned to make progress on these problems for four reasons: our initial technical results have been promising (see our publications), our methodology has a good track record of working in the past (see MIRI’s Approach), we have already had a significant influence on the debate about long-run AI outcomes (see Assessing Our Past and Potential Impact), and we have an exclusive focus on these issues (see What Sets MIRI Apart?). MIRI is currently the only organization specializing in long-term technical AI safety research, and our independence from industry and academia allows us to effectively address gaps in other institutions’ research efforts.

#### General Progress This Year

In June, Luke Muehlhauser left MIRI for a research position at the Open Philanthropy Project. I replaced Luke as MIRI’s Executive Director, and I’m happy to say that the transition has gone well. We’ve split our time between technical research and academic outreach, running a workshop series aimed at introducing a wider scientific audience to our work and sponsoring a three-week summer fellows program aimed at training skills required to do groundbreaking theoretical research.

Our fundraiser this summer was our biggest to date. We raised a total of $631,957 from 263 distinct donors, smashing our previous funding drive record by over$200,000. Medium-sized donors stepped up their game to help us hit our first two funding targets: many more donors gave between $5,000 and$50,000 than in past fundraisers. Our successful fundraisers, workshops, and fellows program have allowed us to ramp up our growth substantially, and have already led directly to several new researcher hires.

2015 has been an astounding year for AI safety engineering. In January, the Future of Life Institute brought together the leading organizations studying long-term AI risk and top AI researchers in academia and industry for a “Future of AI” conference in San Juan, Puerto Rico. Out of this conference came a widely endorsed open letter, accompanied by a research priorities document drawing heavily on MIRI’s work. Two prominent AI scientists who helped organize the event, Stuart Russell and Bart Selman, have since become MIRI research advisors (in June and July, respectively). The conference also resulted in an AI safety grants program, with MIRI receiving some of the largest grants.

In addition to the FLI conference, we’ve spoken this year at AAAI-15, AGI-15, LORI 2015, EA Global, the American Physical Society, and the leading U.S. science and technology think tank, ITIF. We also co-organized a decision theory conference at Cambridge University and ran a ten-week seminar series at UC Berkeley.

Three new full-time research fellows have joined our team this year: Patrick LaVictoire in March, Jessica Taylor in August, and Andrew Critch in September. Scott Garrabrant will become our newest research fellow this month, after having made major contributions as a workshop attendee and research associate.

Meanwhile, our two new research interns, Kaya Stechly and Rafael Cosman, have been going through old results and consolidating and polishing material into new papers; and three of our new research associates, Vadim Kosoy, Abram Demski, and Tsvi Benson-Tilsen, have been producing a string of promising results on our research forum. Another intern, Jack Gallagher, contributed to our type theory project over the summer.

To accommodate our growing team, we’ve recently hired a new office manager, Andrew Lapinski-Barker, and will be moving into a larger office space this month. On the whole, I’m very pleased with our new academic collaborations, outreach efforts, and growth.

#### Research Progress This Year

As our research projects and collaborations have multiplied, we’ve made more use of online mechanisms for quick communication and feedback between researchers. In March, we launched the Intelligent Agent Foundations Forum, a discussion forum for AI alignment research. Many of our subsequent publications have been developed from material on the forum, beginning with Patrick LaVictoire’s “An introduction to Löb’s theorem in MIRI’s research.”

We have also produced a number of new papers in 2015 and, most importantly, arrived at new research insights.

In July, we revised our primary technical agenda paper for 2016 publication. Our other new publications and results can be categorized by their place in the research agenda:

We’ve been exploring new approaches to the problems of naturalized induction and logical uncertainty, with early results published in various venues, including Fallenstein et al.’s “Reflective oracles” (presented in abridged form at LORI 2015) and “Reflective variants of Solomonoff induction and AIXI” (presented at AGI-15), and Garrabrant et al.’s “Asymptotic logical uncertainty and the Benford test” (available on arXiv). We also published the overview papers “Formalizing two problems of realistic world-models” and “Questions of reasoning under logical uncertainty.”

In decision theory, Patrick LaVictoire and others have developed new results pertaining to bargaining and division of trade gains, using the proof-based decision theory framework (example). Meanwhile, the team has been developing a better understanding of the strengths and limitations of different approaches to decision theory, an effort spearheaded by Eliezer Yudkowsky, Benya Fallenstein, and me, culminating in some insights that will appear in a paper next year. Andrew Critch has proved some promising results about bounded versions of proof-based decision-makers, which will also appear in an upcoming paper. Additionally, we presented a shortened version of our overview paper at AGI-15.

In Vingean reflection, Benya Fallenstein and Research Associate Ramana Kumar collaborated on “Proof-producing reflection for HOL” (presented at ITP 2015) and have been working on an FLI-funded implementation of reflective reasoning in the HOL theorem prover. Separately, the reflective oracle framework has helped us gain a better understanding of what kinds of reflection are and are not possible, yielding some nice technical results and a few insights that seem promising. We also published the overview paper “Vingean reflection.”

Jessica Taylor, Benya Fallenstein, and Eliezer Yudkowsky have focused on error tolerance on and off throughout the year. We released Taylor’s “Quantilizers” (accepted to a workshop at AAAI-16) and presented the paper “Corrigibility” at a AAAI-15 workshop.

In value specification, we published the AAAI-15 workshop paper “Concept learning for safe autonomous AI” and the overview paper “The value learning problem.” With support from an FLI grant, Jessica Taylor is working on better formalizing subproblems in this area, and has recently begun writing up her thoughts on this subject on the research forum.

Lastly, in forecasting and strategy, we published “Formalizing convergent instrumental goals” (accepted to a AAAI-16 workshop) and two historical case studies: “The Asilomar Conference” and “Leó Szilárd and the danger of nuclear weapons.” Many other strategic analyses have been posted to the recently revamped AI Impacts site, where Katja Grace has been publishing research about patterns in technological development.

#### Fundraiser Targets and Future Plans

Like our last fundraiser, this will be a non-matching fundraiser with multiple funding targets our donors can choose between to help shape MIRI’s trajectory. Our successful summer fundraiser has helped determine how ambitious we’re making our plans; although we may still slow down or accelerate our growth based on our fundraising performance, our current plans assume a budget of roughly $1,825,000 per year. Of this, about$100,000 is being paid for in 2016 through FLI grants, funded by Elon Musk and the Open Philanthropy Project. The rest depends on our fundraising and grant-writing success. We have a twelve-month runway as of January 1, which we would ideally like to extend.

Taking all of this into account, our winter funding targets are:

Target 1 — $150k: Holding steady. At this level, we would have enough funds to maintain our runway in early 2016 while continuing all current operations, including running workshops, writing papers, and attending conferences. Target 2 —$450k: Maintaining MIRI’s growth rate. At this funding level, we would be much more confident that our new growth plans are sustainable, and we would be able to devote more attention to academic outreach. We would be able to spend less staff time on fundraising in the coming year, and might skip our summer fundraiser.

Target 3 — $1M: Bigger plans, faster growth. At this level, we would be able to substantially increase our recruiting efforts and take on new research projects. It would be evident that our donors’ support is stronger than we thought, and we would move to scale up our plans and growth rate accordingly. Target 4 —$6M: A new MIRI. At this point, MIRI would become a qualitatively different organization. With this level of funding, we would be able to diversify our research initiatives and begin branching out from our current agenda into alternative angles of attack on the AI alignment problem.

Our projected spending over the next twelve months, excluding earmarked funds for the independent AI Impacts project, breaks down as follows:

Our largest cost ($700,000) is in wages and benefits for existing research staff and contracted researchers, including research associates. Our current priority is to further expand the team. We expect to spend an additional$150,000 on salaries and benefits for new research staff in 2016, but that number could go up or down significantly depending on when new research fellows begin work:

• Mihály Bárász, who was originally slated to begin in November 2015, has delayed his start date due to unexpected personal circumstances. He plans to join the team in 2016.
• We are recruiting a specialist for our type theory in type theory project, which is aimed at developing simple programmatic models of reflective reasoners. Interest in this topic has been increasing recently, which is exciting; but the basic tools needed for our work are still missing. If you have programmer or mathematician friends who are interested in dependently typed programming languages and MIRI’s work, you can send them our application form.
• We are considering several other possible additions to the research team.

Much of the rest of our budget goes into fixed costs that will not need to grow much as we expand the research team. This includes $475,000 for administrator wages and benefits and$250,000 for costs of doing business. Our main cost of doing business is renting office space (slightly over $100,000). Note that the boundaries between these categories are sometimes fuzzy. For example, my salary is included in the admin staff category, despite the fact that I spend some of my time on technical research (and hope to increase that amount in 2016). Our remaining budget goes into organizing or sponsoring research events, such as fellows programs, MIRIx events, or workshops ($250,000). Some activities (e.g., traveling to conferences) are aimed at sharing our work with the larger academic community. Others, such as researcher retreats, are focused on solving open problems in our research agenda. After experimenting with different types of research staff retreat in 2015, we’re beginning to settle on a model that works well, and we’ll be running a number of retreats throughout 2016.

In past years, we’ve generally raised \$1M per year, and spent a similar amount. Thanks to substantial recent increases in donor support, however, we’re in a position to scale up significantly.

Our donors blew us away with their support in our last fundraiser. If we can continue our fundraising and grant successes, we’ll be able to sustain our new budget and act on the unique opportunities outlined in Why Now Matters, helping set the agenda and build the formal tools for the young field of AI safety engineering. And if our donors keep stepping up their game, we believe we have the capacity to scale up our program even faster. We’re thrilled at this prospect, and we’re enormously grateful for your support.

# New paper: “Quantilizers”

|   |  Papers

MIRI Research Fellow Jessica Taylor has written a new paper on an error-tolerant framework for software agents, “Quantilizers: A safer alternative to maximizers for limited optimization.” Taylor’s paper will be presented at the AAAI-16 AI, Ethics and Society workshop. The abstract reads:

In the field of AI, expected utility maximizers are commonly used as a model for idealized agents. However, expected utility maximization can lead to unintended solutions when the utility function does not quantify everything the operators care about: imagine, for example, an expected utility maximizer tasked with winning money on the stock market, which has no regard for whether it accidentally causes a market crash. Once AI systems become sufficiently intelligent and powerful, these unintended solutions could become quite dangerous. In this paper, we describe an alternative to expected utility maximization for powerful AI systems, which we call expected utility quantilization. This could allow the construction of AI systems that do not necessarily fall into strange and unanticipated shortcuts and edge cases in pursuit of their goals.

Expected utility quantilization is the approach of selecting a random action in the top n% of actions from some distribution γ, sorted by expected utility. The distribution γ might, for example, be a set of actions weighted by how likely a human is to perform them. A quantilizer based on such a distribution would behave like a compromise between a human and an expected utility maximizer. The agent’s utility function directs it toward intuitively desirable outcomes in novel ways, making it potentially more useful than a digitized human; while γ directs it toward safer and more predictable strategies.

Quantilization is a formalization of the idea of “satisficing,” or selecting actions that achieve some minimal threshold of expected utility. Agents that try to pick good strategies, but not maximally good ones, seem less likely to come up with extraordinary and unconventional strategies, thereby reducing both the benefits and the risks of smarter-than-human AI systems. Designing AI systems to satisfice looks especially useful for averting harmful convergent instrumental goals and perverse instantiations of terminal goals:

• If we design an AI system to cure cancer, and γ labels it bizarre to reduce cancer rates by increasing the rate of some other terminal illness, them a quantilizer will be less likely to adopt this perverse strategy even if our imperfect specification of the system’s goals gave this strategy high expected utility.
• If superintelligent AI systems have a default incentive to seize control of resources, but γ labels these policies bizarre, then a quantilizer will be less likely to converge on these strategies.

Taylor notes that the quantilizing approach to satisficing may even allow us to disproportionately reap the benefits of maximization without incurring proportional costs, by specifying some restricted domain in which the quantilizer has low impact without requiring that it have low impact overall — “targeted-impact” quantilization.

One obvious objection to the idea of satisficing is that a satisficing agent might build an expected utility maximizer. Maximizing, after all, can be an extremely effective way to satisfice. Quantilization can potentially avoid this objection: maximizing and quantilizing may both be good ways to satisfice, but maximizing is not necessarily an effective way to quantilize. A quantilizer that deems the act of delegating to a maximizer “bizarre” will avoid delegating its decisions to an agent even if that agent would maximize the quantilizer’s expected utility.

Taylor shows that the cost of relying on a 0.1-quantilizer (which selects a random action from the top 10% of actions), on expectation, is no more than 10 times that of relying on the recommendation of its distribution γ; the expected cost of relying on a 0.01-quantilizer (which selects from the top 1% of actions) is no more than 100 times that of relying on γ; and so on. Quantilization is optimal among the set of strategies that are low-cost in this respect.

However, expected utility quantilization is not a magic bullet. It depends strongly on how we specify the action distribution γ, and Taylor shows that ordinary quantilizers behave poorly in repeated games and in scenarios where “ordinary” actions in γ tend to have very high or very low expected utility. Further investigation is needed to determine if quantilizers (or some variant on quantilizers) can remedy these problems.