November 2017 Newsletter

 |   |  Newsletters

Eliezer Yudkowsky has written a new book on civilizational dysfunction and outperformance: Inadequate Equilibria: Where and How Civilizations Get Stuck. The full book will be available in print and electronic formats November 16. To preorder the ebook or sign up for updates, visit

We’re posting the full contents online in stages over the next two weeks. The first two chapters are:

  1. Inadequacy and Modesty (discussion: LessWrong, EA Forum, Hacker News)
  2. An Equilibrium of No Free Energy (discussion: LessWrong, EA Forum)


Research updates

General updates

News and links

New paper: “Functional Decision Theory”

 |   |  Papers

Functional Decision Theory

MIRI senior researcher Eliezer Yudkowsky and executive director Nate Soares have a new introductory paper out on decision theory: “Functional decision theory: A new theory of instrumental rationality.”


This paper describes and motivates a new decision theory known as functional decision theory (FDT), as distinct from causal decision theory and evidential decision theory.

Functional decision theorists hold that the normative principle for action is to treat one’s decision as the output of a fixed mathematical function that answers the question, “Which output of this very function would yield the best outcome?” Adhering to this principle delivers a number of benefits, including the ability to maximize wealth in an array of traditional decision-theoretic and game-theoretic problems where CDT and EDT perform poorly. Using one simple and coherent decision rule, functional decision theorists (for example) achieve more utility than CDT on Newcomb’s problem, more utility than EDT on the smoking lesion problem, and more utility than both in Parfit’s hitchhiker problem.

In this paper, we define FDT, explore its prescriptions in a number of different decision problems, compare it to CDT and EDT, and give philosophical justifications for FDT as a normative theory of decision-making.

Our previous introductory paper on FDT, “Cheating Death in Damascus,” focused on comparing FDT’s performance to that of CDT and EDT in fairly high-level terms. Yudkowsky and Soares’ new paper puts a much larger focus on FDT’s mechanics and motivations, making “Functional Decision Theory” the most complete stand-alone introduction to the theory.1

Read more »

  1. “Functional Decision Theory” was originally drafted prior to “Cheating Death in Damascus,” and was significantly longer before we received various rounds of feedback from the philosophical community. “Cheating Death in Damascus” was produced from material that was cut from early drafts; other cut material included a discussion of proof-based decision theory, and some Death in Damascus variants left on the cutting room floor for being needlessly cruel to CDT. 

AlphaGo Zero and the Foom Debate

 |   |  Analysis

AlphaGo Zero uses 4 TPUs, is built entirely out of neural nets with no handcrafted features, doesn’t pretrain against expert games or anything else human, reaches a superhuman level after 3 days of self-play, and is the strongest version of AlphaGo yet.

The architecture has been simplified. Previous AlphaGo had a policy net that predicted good plays, and a value net that evaluated positions, both feeding into lookahead using MCTS (random probability-weighted plays out to the end of a game). AlphaGo Zero has one neural net that selects moves and this net is trained by Paul-Christiano-style capability amplification, playing out games against itself to learn new probabilities for winning moves.

As others have also remarked, this seems to me to be an element of evidence that favors the Yudkowskian position over the Hansonian position in my and Robin Hanson’s AI-foom debate.

As I recall and as I understood:

  • Hanson doubted that what he calls “architecture” is much of a big deal, compared to (Hanson said) elements like cumulative domain knowledge, or special-purpose components built by specialized companies in what he expects to be an ecology of companies serving an AI economy.
  • When I remarked upon how it sure looked to me like humans had an architectural improvement over chimpanzees that counted for a lot, Hanson replied that this seemed to him like a one-time gain from allowing the cultural accumulation of knowledge.

I emphasize how all the mighty human edifice of Go knowledge, the joseki and tactics developed over centuries of play, the experts teaching children from an early age, was entirely discarded by AlphaGo Zero with a subsequent performance improvement. These mighty edifices of human knowledge, as I understand the Hansonian thesis, are supposed to be the bulwark against rapid gains in AI capability across multiple domains at once. I said, “Human intelligence is crap and our accumulated skills are crap,” and this appears to have been borne out.

Similarly, single research labs like DeepMind are not supposed to pull far ahead of the general ecology, because adapting AI to any particular domain is supposed to require lots of components developed all over the place by a market ecology that makes those components available to other companies. AlphaGo Zero is much simpler than that. To the extent that nobody else can run out and build AlphaGo Zero, it’s either because Google has Tensor Processing Units that aren’t generally available, or because DeepMind has a silo of expertise for being able to actually make use of existing ideas like ResNets, or both.

Sheer speed of capability gain should also be highlighted here. Most of my argument for FOOM in the Yudkowsky-Hanson debate was about self-improvement and what happens when an optimization loop is folded in on itself. Though it wasn’t necessary to my argument, the fact that Go play went from “nobody has come close to winning against a professional” to “so strongly superhuman they’re not really bothering any more” over two years just because that’s what happens when you improve and simplify the architecture, says you don’t even need self-improvement to get things that look like FOOM.

Yes, Go is a closed system allowing for self-play. It still took humans centuries to learn how to play it. Perhaps the new Hansonian bulwark against rapid capability gain can be that the environment has lots of empirical bits that are supposed to be very hard to learn, even in the limit of AI thoughts fast enough to blow past centuries of human-style learning in 3 days; and that humans have learned these vital bits over centuries of cultural accumulation of knowledge, even though we know that humans take centuries to do 3 days of AI learning when humans have all the empirical bits they need; and that AIs cannot absorb this knowledge very quickly using “architecture”, even though humans learn it from each other using architecture. If so, then let’s write down this new world-wrecking assumption (that is, the world ends if the assumption is false) and be on the lookout for further evidence that this assumption might perhaps be wrong.

AlphaGo clearly isn’t a general AI. There’s obviously stuff humans do that make us much more general than AlphaGo, and AlphaGo obviously doesn’t do that. However, if even with the human special sauce we’re to expect AGI capabilities to be slow, domain-specific, and requiring feed-in from a big market ecology, then the situation we see without human-equivalent generality special sauce should not look like this.

To put it another way, I put a lot of emphasis in my debate on recursive self-improvement and the remarkable jump in generality across the change from primate intelligence to human intelligence. It doesn’t mean we can’t get info about speed of capability gains without self-improvement. It doesn’t mean we can’t get info about the importance and generality of algorithms without the general intelligence trick. The debate can start to settle for fast capability gains before we even get to what I saw as the good parts; I wouldn’t have predicted AlphaGo and lost money betting against the speed of its capability gains, because reality held a more extreme position than I did on the Yudkowsky-Hanson spectrum.

(Reply from Robin Hanson.)

October 2017 Newsletter

 |   |  Newsletters

“So far as I can presently estimate, now that we’ve had AlphaGo and a couple of other maybe/maybe-not shots across the bow, and seen a huge explosion of effort invested into machine learning and an enormous flood of papers, we are probably going to occupy our present epistemic state until very near the end.

“[…I]t’s hard to guess how many further insights are needed for AGI, or how long it will take to reach those insights. After the next breakthrough, we still won’t know how many more breakthroughs are needed, leaving us in pretty much the same epistemic state as before. […] You can either act despite that, or not act. Not act until it’s too late to help much, in the best case; not act at all until after it’s essentially over, in the average case.”

Read more in a new blog post by Eliezer Yudkowsky: “There’s No Fire Alarm for Artificial General Intelligence.” (Discussion on LessWrong 2.0, Hacker News.)

Research updates

General updates

News and links

There’s No Fire Alarm for Artificial General Intelligence

 |   |  Analysis


What is the function of a fire alarm?


One might think that the function of a fire alarm is to provide you with important evidence about a fire existing, allowing you to change your policy accordingly and exit the building.

In the classic experiment by Latane and Darley in 1968, eight groups of three students each were asked to fill out a questionnaire in a room that shortly after began filling up with smoke. Five out of the eight groups didn’t react or report the smoke, even as it became dense enough to make them start coughing. Subsequent manipulations showed that a lone student will respond 75% of the time; while a student accompanied by two actors told to feign apathy will respond only 10% of the time. This and other experiments seemed to pin down that what’s happening is pluralistic ignorance. We don’t want to look panicky by being afraid of what isn’t an emergency, so we try to look calm while glancing out of the corners of our eyes to see how others are reacting, but of course they are also trying to look calm.

(I’ve read a number of replications and variations on this research, and the effect size is blatant. I would not expect this to be one of the results that dies to the replication crisis, and I haven’t yet heard about the replication crisis touching it. But we have to put a maybe-not marker on everything now.)

A fire alarm creates common knowledge, in the you-know-I-know sense, that there is a fire; after which it is socially safe to react. When the fire alarm goes off, you know that everyone else knows there is a fire, you know you won’t lose face if you proceed to exit the building.

The fire alarm doesn’t tell us with certainty that a fire is there. In fact, I can’t recall one time in my life when, exiting a building on a fire alarm, there was an actual fire. Really, a fire alarm is weaker evidence of fire than smoke coming from under a door.

But the fire alarm tells us that it’s socially okay to react to the fire. It promises us with certainty that we won’t be embarrassed if we now proceed to exit in an orderly fashion.

It seems to me that this is one of the cases where people have mistaken beliefs about what they believe, like when somebody loudly endorsing their city’s team to win the big game will back down as soon as asked to bet. They haven’t consciously distinguished the rewarding exhilaration of shouting that the team will win, from the feeling of anticipating the team will win.

When people look at the smoke coming from under the door, I think they think their uncertain wobbling feeling comes from not assigning the fire a high-enough probability of really being there, and that they’re reluctant to act for fear of wasting effort and time. If so, I think they’re interpreting their own feelings mistakenly. If that was so, they’d get the same wobbly feeling on hearing the fire alarm, or even more so, because fire alarms correlate to fire less than does smoke coming from under a door. The uncertain wobbling feeling comes from the worry that others believe differently, not the worry that the fire isn’t there. The reluctance to act is the reluctance to be seen looking foolish, not the reluctance to waste effort. That’s why the student alone in the room does something about the fire 75% of the time, and why people have no trouble reacting to the much weaker evidence presented by fire alarms.



It’s now and then proposed that we ought to start reacting later to the issues of Artificial General Intelligence (background here), because, it is said, we are so far away from it that it just isn’t possible to do productive work on it today.

(For direct argument about there being things doable today, see: Soares and Fallenstein (2014/2017); Amodei, Olah, Steinhardt, Christiano, Schulman, and Mané (2016); or Taylor, Yudkowsky, LaVictoire, and Critch (2016).)

(If none of those papers existed or if you were an AI researcher who’d read them but thought they were all garbage, and you wished you could work on alignment but knew of nothing you could do, the wise next step would be to sit down and spend two hours by the clock sincerely trying to think of possible approaches. Preferably without self-sabotage that makes sure you don’t come up with anything plausible; as might happen if, hypothetically speaking, you would actually find it much more comfortable to believe there was nothing you ought to be working on today, because e.g. then you could work on other things that interested you more.)

(But never mind.)

So if AGI seems far-ish away, and you think the conclusion licensed by this is that you can’t do any productive work on AGI alignment yet, then the implicit alternative strategy on offer is: Wait for some unspecified future event that tells us AGI is coming near; and then we’ll all know that it’s okay to start working on AGI alignment.

This seems to me to be wrong on a number of grounds. Here are some of them.

Read more »

September 2017 Newsletter

 |   |  Newsletters

Research updates

General updates

  • As part of his engineering internship at MIRI, Max Harms assisted in the construction and extension of RL-Teacher, an open-source tool for training AI systems with human feedback based on the “Deep RL from Human Preferences” OpenAI / DeepMind research collaboration. See OpenAI’s announcement.
  • MIRI COO Malo Bourgon participated in panel discussions on getting things done (video) and working in AI (video) at the Effective Altruism Global conference in San Francisco. AI Impacts researcher Katja Grace also spoke on AI safety (video). Other EAG talks on AI included Daniel Dewey’s (video) and Owen Cotton-Barratt’s (video), and a larger panel discussion (video).
  • Announcing two winners of the Intelligence in Literature prize: Laurence Raphael Brothers’ “Houseproud” and Shane Halbach’s “Human in the Loop”.
  • RAISE, a project to develop online AI alignment course material, is seeking volunteers. 

News and links

  • The Open Philanthropy Project is accepting applicants to an AI Fellows Program “to fully support a small group of the most promising PhD students in artificial intelligence and machine learning”. See also Open Phil’s partial list of key research topics in AI alignment.
  • Call for papers: AAAI and ACM are running a new Conference on AI, Ethics, and Society, with submissions due by the end of October.
  • DeepMind’s Viktoriya Krakovna argues for a portfolio approach to AI safety research.
  • Teaching AI Systems to Behave Themselves”: a solid article from the New York Times on the growing field of AI safety research. The Times also has an opening for an investigative reporter in AI.
  • UC Berkeley’s Center for Long-term Cybersecurity is hiring for several roles, including researcher, assistant to the director, and program manager.
  • Life 3.0: Max Tegmark releases a new book on the future of AI (podcast discussion).

New paper: “Incorrigibility in the CIRL Framework”

 |   |  Papers

Incorrigibility in the CIRL Framework

MIRI assistant research fellow Ryan Carey has a new paper out discussing situations where good performance in Cooperative Inverse Reinforcement Learning (CIRL) tasks fails to imply that software agents will assist or cooperate with programmers.

The paper, titled “Incorrigibility in the CIRL Framework,” lays out four scenarios in which CIRL violates the four conditions for corrigibility defined in Soares et al. (2015). Abstract:

A value learning system has incentives to follow shutdown instructions, assuming the shutdown instruction provides information (in the technical sense) about which actions lead to valuable outcomes. However, this assumption is not robust to model mis-specification (e.g., in the case of programmer errors). We demonstrate this by presenting some Supervised POMDP scenarios in which errors in the parameterized reward function remove the incentive to follow shutdown commands. These difficulties parallel those discussed by Soares et al. (2015) in their paper on corrigibility.

We argue that it is important to consider systems that follow shutdown commands under some weaker set of assumptions (e.g., that one small verified module is correctly implemented; as opposed to an entire prior probability distribution and/or parameterized reward function). We discuss some difficulties with simple ways to attempt to attain these sorts of guarantees in a value learning framework.

The paper is a response to a paper by Hadfield-Menell, Dragan, Abbeel, and Russell, “The Off-Switch Game.” Hadfield-Menell et al. show that an AI system will be more responsive to human inputs when it is uncertain about its reward function and thinks that its human operator has more information about this reward function. Carey shows that the CIRL framework can be used to formalize the problem of corrigibility, and that the known assurances for CIRL systems, given in “The Off-Switch Game”, rely on strong assumptions about having an error-free CIRL system. With less idealized assumptions, a value learning agent may have beliefs that cause it to evade redirection from the human.

[T]he purpose of a shutdown button is to shut the AI system down in the event that all other assurances failed, e.g., in the event that the AI system is ignoring (for one reason or another) the instructions of the operators. If the designers of [the AI system] R have programmed the system so perfectly that the prior and [reward function] R are completely free of bugs, then the theorems of Hadfield-Menell et al. (2017) do apply. In practice, this means that in order to be corrigible, it would be necessary to have an AI system that was uncertain about all things that could possibly matter. The problem is that performing Bayesian reasoning over all possible worlds and all possible value functions is quite intractable. Realistically, humans will likely have to use a large number of heuristics and approximations in order to implement the system’s belief system and updating rules. […]

Soares et al. (2015) seem to want a shutdown button that works as a mechanism of last resort, to shut an AI system down in cases where it has observed and refused a programmer suggestion (and the programmers believe that the system is malfunctioning). Clearly, some part of the system must be working correctly in order for us to expect the shutdown button to work at all. However, it seems undesirable for the working of the button to depend on there being zero critical errors in the specification of the system’s prior, the specification of the reward function, the way it categorizes different types of actions, and so on. Instead, it is desirable to develop a shutdown module that is small and simple, with code that could ideally be rigorously verified, and which ideally works to shut the system down even in the event of large programmer errors in the specification of the rest of the system.

In order to do this in a value learning framework, we require a value learning system that (i) is capable of having its actions overridden by a small verified module that watches for shutdown commands; (ii) has no incentive to remove, damage, or ignore the shutdown module; and (iii) has some small incentive to keep its shutdown module around; even under a broad range of cases where R, the prior, the set of available actions, etc. are misspecified.

Even if the utility function is learned, there is still a need for additional lines of defense against unintended failures. The hope is that this can be achieved by modularizing the AI system. For that purpose, we would need a model of an agent that will behave corrigibly in a way that is robust to misspecification of other system components.


Sign up to get updates on new MIRI technical results

Get notified every time a new technical paper is published.


August 2017 Newsletter

 |   |  Newsletters

Research updates

General updates

News and links