# Ensuring smarter-than-human intelligence has a positive outcome

|   |  Analysis, Video

I recently gave a talk at Google on the problem of aligning smarter-than-human AI with operators’ goals:

The talk was inspired by “AI Alignment: Why It’s Hard, and Where to Start,” and serves as an introduction to the subfield of alignment research in AI. A modified transcript follows.

Talk outline (slides):

1. Overview

# Decisions are for making bad outcomes inconsistent

|   |  Conversations

Nate Soares’ recent decision theory paper with Ben Levinstein, “Cheating Death in Damascus,” prompted some valuable questions and comments from an acquaintance (anonymized here). I’ve put together edited excerpts from the commenter’s email below, with Nate’s responses.

The discussion concerns functional decision theory (FDT), a newly proposed alternative to causal decision theory (CDT) and evidential decision theory (EDT). Where EDT says “choose the most auspicious action” and CDT says “choose the action that has the best effects,” FDT says “choose the output of one’s decision algorithm that has the best effects across all instances of that algorithm.”

FDT usually behaves similarly to CDT. In a one-shot prisoner’s dilemma between two agents who know they are following FDT, however, FDT parts ways with CDT and prescribes cooperation, on the grounds that each agent runs the same decision-making procedure, and that therefore each agent is effectively choosing for both agents at once.1

Below, Nate provides some of his own perspective on why FDT generally achieves higher utility than CDT and EDT. Some of the stances he sketches out here are stronger than the assumptions needed to justify FDT, but should shed some light on why researchers at MIRI think FDT can help resolve a number of longstanding puzzles in the foundations of rational action.

Anonymous: This is great stuff! I’m behind on reading loads of papers and books for my research, but this came across my path and hooked me, which speaks highly of how interesting is the content and the sense that this paper is making progress.

My general take is that you are right that these kinds of problems need to be specified in more detail. However, my guess is that once you do so, game theorists would get the right answer. Perhaps that’s what FDT is: it’s an approach to clarifying ambiguous games that leads to a formalism where people like Pearl and myself can use our standard approaches to get the right answer.

I know there’s a lot of inertia in the “decision theory” language, so probably it doesn’t make sense to change. But if there were no such sunk costs, I would recommend a different framing. It’s not that people’s decision theories are wrong; it’s that they are unable to correctly formalize problems in which there are high-performance predictors. You show how to do that, using the idea of intervening on (i.e., choosing between putative outputs of) the algorithm, rather than intervening on actions. Everything else follows from a sufficiently precise and non-contradictory statement of the decision problem.

Probably the easiest move this line of work could make to ease this knee-jerk response of mine in defense of mainstream Bayesian game theory is to just be clear that CDT is not meant to capture mainstream Bayesian game theory. Rather, it is a model of one response to a class of problems not normally considered and for which existing approaches are ambiguous.

Nate Soares: I don’t take this view myself. My view is more like: When you add accurate predictors to the Rube Goldberg machine that is the universe — which can in fact be done — the future of that universe can be determined by the behavior of the algorithm being predicted. The algorithm that we put in the “thing-being-predicted” slot can do significantly better if its reasoning on the subject of which actions to output respects the universe’s downstream causal structure (which is something CDT and FDT do, but which EDT neglects), and it can do better again if its reasoning also respects the world’s global logical structure (which is done by FDT alone).

We don’t know exactly how to respect this wider class of dependencies in general yet, but we do know how to do it in many simple cases. While it agrees with modern decision theory and game theory in many simple situations, its prescriptions do seem to differ in non-trivial applications.

The main case where we can easily see that FDT is not just a better tool for formalizing game theorists’ traditional intuitions is in prisoner’s dilemmas. Game theory is pretty adamant about the fact that it’s rational to defect in a one-shot PD, whereas two FDT agents facing off in a one-shot PD will cooperate.

In particular, classical game theory employs a “common knowledge of shared rationality” assumption which, when you look closely at it, cashes out more or less as “common knowledge that all parties are using CDT and this axiom.” Game theory where common knowledge of shared rationality is defined to mean “common knowledge that all parties are using FDT and this axiom” gives substantially different results, such as cooperation in one-shot PDs.

1. CDT prescribes defection in this dilemma, on the grounds that one’s action cannot cause the other agent to cooperate. FDT outperforms CDT in Newcomblike dilemmas like these, while also outperforming EDT in other dilemmas, such as the smoking lesion problem and XOR blackmail.

# April 2017 Newsletter

 Our newest publication, “Cheating Death in Damascus,” makes the case for functional decision theory, our general framework for thinking about rational choice and counterfactual reasoning. In other news, our research team is expanding! Sam Eisenstat and Marcello Herreshoff, both previously at Google, join MIRI this month. Research updates New at IAFF: “Formal Open Problem in Decision Theory” New at AI Impacts: “Trends in Algorithmic Progress”; “Progress in General-Purpose Factoring” We ran a weekend workshop on agent foundations and AI safety. General updates Our annual review covers our research progress, fundraiser outcomes, and other take-aways from 2016. We attended the Colloquium on Catastrophic and Existential Risk. Nate Soares weighs in on the Future of Life Institute’s Risk Principle. “Elon Musk’s Billion-Dollar Crusade to Stop the AI Apocalypse” features quotes from Eliezer Yudkowsky, Demis Hassabis, Mark Zuckerberg, Peter Thiel, Stuart Russell, and others. News and links The Open Philanthropy Project and OpenAI begin a partnership: Holden Karnofsky joins Elon Musk and Sam Altman on OpenAI’s Board of Directors, and Open Philanthropy contributes $30M to OpenAI’s research program. Open Philanthropy has also awarded$2M to the Future of Humanity Institute. Modeling Agents with Probabilistic Programs: a new book by Owain Evans, Andreas Stuhlmüller, John Salvatier, and Daniel Filan. New from OpenAI: “Evolution Strategies as a Scalable Alternative to Reinforcement Learning”; “Learning to Communicate”; “One-Shot Imitation Learning”; and from Paul Christiano, “Benign Model-Free RL.” Chris Olah and Shan Carter discuss research debt as an obstacle to clear thinking and the transmission of ideas, and propose Distill as a solution. Andrew Trask proposes encrypting deep learning algorithms during training. Roman Yampolskiy seeks submissions for a book on AI safety and security. 80,000 Hours has updated their problem profile on positively shaping the development of AI, a solid introduction to AI risk — which 80K now ranks as the most urgent problem in the world. See also 80K’s write-up on in-demand skill sets at effective altruism oragnizations.

# Two new researchers join MIRI

|   |  News

MIRI’s research team is growing! I’m happy to announce that we’ve hired two new research fellows to contribute to our work on AI alignment: Sam Eisenstat and Marcello Herreshoff, both from Google.

Sam Eisenstat studied pure mathematics at the University of Waterloo, where he carried out research in mathematical logic. His previous work was on the automatic construction of deep learning models at Google.

Sam’s research focus is on questions relating to the foundations of reasoning and agency, and he is especially interested in exploring analogies between current theories of logical uncertainty and Bayesian reasoning. He has also done work on decision theory and counterfactuals. His past work with MIRI includes “Asymptotic Decision Theory,” “A Limit-Computable, Self-Reflective Distribution,” and “A Counterexample to an Informal Conjecture on Proof Length and Logical Counterfactuals.”

Marcello Herreshoff studied at Stanford, receiving a B.S. in Mathematics with Honors and getting two honorable mentions in the Putnam Competition, the world’s most highly regarded university-level math competition. Marcello then spent five years as a software engineer at Google, gaining a background in machine learning.

Marcello is one of MIRI’s earliest research collaborators, and attended our very first research workshop alongside Eliezer Yudkowsky, Paul Christiano, and Mihály Bárász. Marcello has worked with us in the past to help produce results such as “Program Equilibrium in the Prisoner’s Dilemma via Löb’s Theorem,” “Definability of Truth in Probabilistic Logic,” and “Tiling Agents for Self-Modifying AI.” His research interests include logical uncertainty and the design of reflective agents.

Sam and Marcello will be starting with us in the first two weeks of April. This marks the beginning of our first wave of new research fellowships since 2015, though we more recently added Ryan Carey to the team on an assistant research fellowship (in mid-2016).

We have additional plans to expand our research team in the coming months, and will soon be hiring for a more diverse set of technical roles at MIRI — details forthcoming!

# 2016 in review

|   |  MIRI Strategy

It’s time again for my annual review of MIRI’s activities.1 In this post I’ll provide a summary of what we did in 2016, see how our activities compare to our previously stated goals and predictions, and reflect on how our strategy this past year fits into our mission as an organization. We’ll be following this post up in April with a strategic update for 2017.

After doubling the size of the research team in 2015,2 we slowed our growth in 2016 and focused on integrating the new additions into our team, making research progress, and writing up a backlog of existing results.

2016 was a big year for us on the research front, with our new researchers making some of the most notable contributions. Our biggest news was Scott Garrabrant’s logical inductors framework, which represents by a significant margin our largest progress to date on the problem of logical uncertainty. We additionally released “Alignment for Advanced Machine Learning Systems” (AAMLS), a new technical agenda spearheaded by Jessica Taylor.

We also spent this last year engaging more heavily with the wider AI community, e.g., through the month-long Colloquium Series on Robust and Beneficial Artificial Intelligence we co-ran with the Future of Humanity Institute, and through talks and participation in panels at many events through the year.

1. See our previous reviews: 2015, 2014, 2013
2. From 2015 in review: “Patrick LaVictoire joined in March, Jessica Taylor in August, Andrew Critch in September, and Scott Garrabrant in December. With Nate transitioning to a non-research role, overall we grew from a three-person research team (Eliezer, Benya, and Nate) to a six-person team.”

# New paper: “Cheating Death in Damascus”

|   |  Papers

MIRI Executive Director Nate Soares and Rutgers/UIUC decision theorist Ben Levinstein have a new paper out introducing functional decision theory (FDT), MIRI’s proposal for a general-purpose decision theory.

The paper, titled “Cheating Death in Damascus,” considers a wide range of decision problems. In every case, Soares and Levinstein show that FDT outperforms all earlier theories in utility gained. The abstract reads:

Evidential and Causal Decision Theory are the leading contenders as theories of rational action, but both face fatal counterexamples. We present some new counterexamples, including one in which the optimal action is causally dominated. We also present a novel decision theory, Functional Decision Theory (FDT), which simultaneously solves both sets of counterexamples.

Instead of considering which physical action of theirs would give rise to the best outcomes, FDT agents consider which output of their decision function would give rise to the best outcome. This theory relies on a notion of subjunctive dependence, where multiple implementations of the same mathematical function are considered (even counterfactually) to have identical results for logical rather than causal reasons. Taking these subjunctive dependencies into account allows FDT agents to outperform CDT and EDT agents in, e.g., the presence of accurate predictors. While not necessary for considering classic decision theory problems, we note that a full specification of FDT will require a non-trivial theory of logical counterfactuals and algorithmic similarity.

“Death in Damascus” is a standard decision-theoretic dilemma. In it, a trustworthy predictor (Death) promises to find you and bring your demise tomorrow, whether you stay in Damascus or flee to Aleppo. Fleeing to Aleppo is costly and provides no benefit, since Death, having predicted your future location, will then simply come for you in Aleppo instead of Damascus.

In spite of this, causal decision theory often recommends fleeing to Aleppo — for much the same reason it recommends defecting in the one-shot twin prisoner’s dilemma and two-boxing in Newcomb’s problem. CDT agents reason that Death has already made its prediction, and that switching cities therefore can’t cause Death to learn your new location. Even though the CDT agent recognizes that Death is inescapable, the CDT agent’s decision rule forbids taking this fact into account in reaching decisions. As a consequence, the CDT agent will happily give up arbitrary amounts of utility in a pointless flight from Death.

Causal decision theory fails in Death in Damascus, Newcomb’s problem, and the twin prisoner’s dilemma — and also in the “random coin,” “Death on Olympus,” “asteroids,” and “murder lesion” dilemmas described in the paper — because its counterfactuals only track its actions’ causal impact on the world, and not the rest of the world’s causal (and logical, etc.) structure.

While evidential decision theory succeeds in these dilemmas, it fails in a new decision problem, “XOR blackmail.”1 FDT consistently outperforms both of these theories, providing an elegant account of normative action for the full gamut of known decision problems.

1. Just as the variants on Death in Damascus in Soares and Levinstein’s paper help clarify CDT’s particular point of failure, XOR blackmail drills down more exactly on EDT’s failure point than past decision problems have. In particular, EDT cannot be modified to avoid XOR blackmail in the ways it can be modified to smoke in the smoking lesion problem.

# March 2017 Newsletter

 Research updates New at IAFF: Some Problems with Making Induction Benign; Entangled Equilibria and the Twin Prisoners’ Dilemma; Generalizing Foundations of Decision Theory New at AI Impacts: Changes in Funding in the AI Safety Field; Funding of AI Research MIRI Research Fellow Andrew Critch has started a two-year stint at UC Berkeley’s Center for Human-Compatible AI, helping launch the research program there. “Using Machine Learning to Address AI Risk”: Jessica Taylor explains our AAMLS agenda (in video and blog versions) by walking through six potential problems with highly performing ML systems. General updates Why AI Safety?: A quick summary (originally posted during our fundraiser) of the case for working on AI risk, including notes on distinctive features of our approach and our goals for the field. Nate Soares attended “Envisioning and Addressing Adverse AI Outcomes,” an event pitting red-team attackers against defenders in a variety of AI risk scenarios. We also attended an AI safety strategy retreat run by the Center for Applied Rationality. News and links Ray Arnold provides a useful list of ways the average person help with AI safety. New from OpenAI: attacking machine learning with adversarial examples. OpenAI researcher Paul Christiano explains his view of human intelligence: I think of my brain as a machine driven by a powerful reinforcement learning agent. The RL agent chooses what thoughts to think, which memories to store and retrieve, where to direct my attention, and how to move my muscles. The “I” who speaks and deliberates is implemented by the RL agent, but is distinct and has different beliefs and desires. My thoughts are outputs and inputs to the RL agent, they are not what the RL agent “feels like from the inside.” Christiano describes three directions and desiderata for AI control: reliability and robustness, reward learning, and deliberation and amplification. Sarah Constantin argues that existing techniques won’t scale up to artificial general intelligence absent major conceptual breakthroughs. The Future of Humanity Institute and the Centre for the Study of Existential Risk ran a “Bad Actors and AI” workshop. FHI is seeking interns in reinforcement learning and AI safety. Michael Milford argues against brain-computer interfaces as an AI risk strategy. Open Philanthropy Project head Holden Karnofsky explains why he sees fewer benefits to public discourse than he used to.

# Using machine learning to address AI risk

|   |  Analysis, Video

At the EA Global 2016 conference, I gave a talk on “Using Machine Learning to Address AI Risk”:

It is plausible that future artificial general intelligence systems will share many qualities in common with present-day machine learning systems. If so, how could we ensure that these systems robustly act as intended? We discuss the technical agenda for a new project at MIRI focused on this question.

A recording of my talk is now up online:

The talk serves as a quick survey (for a general audience) of the kinds of technical problems we’re working on under the “Alignment for Advanced ML Systems” research agenda. Included below is a version of the talk in blog post form.1

Talk outline:

1. I also gave a version of this talk at the MIRI/FHI Colloquium on Robust and Beneficial AI.