Our 2018 Fundraiser is now underway!

Robust Cooperation: A Case Study in Friendly AI Research

 |   |  Analysis

robots shaking hands (cropped)

The paper “Robust Cooperation in the Prisoner’s Dilemma: Program Equilibrium via Provability Logic” is among the clearer examples of theoretical progress produced by explicitly FAI-related research goals. What can we learn from this case study in Friendly AI research? How were the results obtained? How did the ideas build on each other? Who contributed which pieces? Which kinds of synergies mattered?

To answer these questions, I spoke to many of the people who contributed to the “robust cooperation” result.

I’ll begin the story in December 2011, when Vladimir Slepnev (a Google engineer in Zurich) posted A model of UDT with a halting oracle, representing joint work with Vladimir Nesov (a computer science grad student in Moscow).1 This post, arguably for the first time,2 presented a formal model of Wei Dai’s updateless decision theory (UDT) and showed that UDT agents would “win” when presented with Newcomb’s Problem — if the universe program and its agent sub-program had access to a halting oracle, anyway. Nisan Stiennon (a math grad student at Stanford) applied Slepnev’s formalization to the problem of proving cooperation using Peano Arithmetic in Formulas of arithmetic that behave like decision agents (Feb. 2012).3

The success of these two posts in formalizing UDT inspired Patrick LaVictoire (a math postdoc in Madison) to attempt a “semi-formal analysis” of timeless decision theory (TDT), an earlier decision theory invented by Eliezer Yudkowsky (MIRI’s founder), which itself had been a significant inspiration for UDT. After three setup posts, LaVictoire thought he had successfully formalized something kind of like TDT in April 2012.

LaVictoire didn’t get much reaction from other TDT/UDT researchers, so when he visited the SF Bay Area for a CFAR workshop in July 2012, he tracked down Yudkowsky, Stiennon, Paul Christiano (a computer science grad student at Berkeley), and several others to talk to them about his attempt to formalize TDT. Their reaction was positive enough that LaVictoire was encouraged to keep working on the approach.

LaVictoire also discussed his work with Slepnev when Slepnev visited the Bay Area in August 2012, and Slepnev pointed out that LaVictoire’s attempted formalization of TDT (now called “Masquerade”) was fatally flawed for Löbian reasons. But in September 2012, LaVictoire was able to patch the problem by having Masquerade escalate between different formal systems. At this point, LaVictoire began writing an early draft of the “Robust Cooperation” paper.

Slepnev insisted on the importance of optimality results, so later that month LaVictoire came up with a candidate optimality notion, and then in October noticed that Masquerade itself failed to be optimal by that definition. This was roughly the state of things when MIRI’s April 2013 workshop began.

Early in the workshop, LaVictoire gave a Masquerade tutorial to the other participants. Tweaking Masquerade eventually led to the concept of modal agents, and LaVictoire and Mihály Barasz (a Google engineer in Zurich) began looking for ways to mechanically validate the actions of such agents against one another. Eventually, Barasz and Marcello Herreshoff (a Google engineer in the Bay Area) developed a model checker for modal agent interactions, so that agents’ choices against other agents could be proved mechanically.

Near the end of the April workshop, Christiano developed PrudentBot, which is in some sense the “star” of the current paper. Additional contributions were made by Yudkowsky, Benja Fallenstein (a grad student at Bristol University), and others during the workshop. LaVictoire updated the draft paper with the April workshop’s results, and posted it to Less Wrong in June 2013.

Later, at MIRI’s September 2013 workshop, Kenny Easwaran (a philosopher at USC) found that it was harder than LaVictoire had expected to prove that any unexploitable agent must eventually fail to optimize against some kind of WaitFairBot. Herreshoff worked on patching this, but the proof was ballooning that section of the paper beyond recognition for a minor result, so LaVictoire decided to remove it from the paper.

In December 2013, Fallenstein found that the paper didn’t adequately show that the actions of two modal agents depend solely on their modal description, and he introduced a set of patches for this. LaVictoire modified the paper once more, and then, with the consent of his co-authors, uploaded the revised paper to arxiv in January 2014.

What, then, is the meaning and significance of the results in the “Robust Cooperation” paper? LaVictoire’s view, at least, is this:

The significance of modal combat is that it’s a toy universe in which we can study concepts of advanced decision theory (and which we might modify slightly in order to study other concepts, like blackmail), and within which the intuitively appealing idea of superrationality in fact works out. It’s at least a philosophical hint that good communication can enable cooperation without the usual costs of enforcement and punishment, and that there are incentives toward simplicity and verifiability among rational agents.

In fact, it’s an even more basic analogue of an Iterated Prisoner’s Dilemma tournament. Just as Axelrod’s IPD [iterated PD] tournament illustrated the usefulness of “tough but fair” and gave rise to the idea of evolutionary incentives for reciprocal altruism, I think that modal combat is a useful sandbox for illustrating the logic of “superrationality.” Furthermore, modal combat includes many of the features of the IPD (with levels of deduction being somewhat analogous to an agent’s historical interactions with another agent), and it has extremely simple grammar for the level of sophistication of these algorithms.


  1. The development of updateless decision theory itself is another story, which won’t be recounted here in detail. Two brief sources on this story are the ‘Prior Work’ section of Vladimir Nesov’s Controlling Constant Programs, and also this comment. Nesov’s very brief summary of the development of UDT goes like this: “(1) Eliezer Yudkowsky’s early informal remarks about TDT and Anna Salamon’s post brought up the point that some situations should be modeled by unusual dependencies, motivating the question of how to select an appropriate model (infer dependencies). (2) Wei Dai’s UDT post sketched one way of doing that, but at the time I didn’t understand that post as answering this question, and eventually figured it out for programs-control-programs case in May 2010. After discussion on the decision theory mailing list, Vladimir Slepnev applied the technique to the prisoner’s dilemma (PD). (3) The more general technique was then written up by Slepnev and me, with Slepnev’s posts having more technical substance, and mine more speculative, attempting to find better ways of framing the theory: What a reduction of ‘could’ could look like, Controlling constant programs, and Notion of preference in ambient control. (4) There were still a number of technical issues around ‘spurious moral arguments’. See this comment by Benja Fallenstein and An example of self-fulfilling spurious proofs in UDT. (5) One solution was adding a ‘chicken rule’ to the decision algorithm, which I figured out for the programs-control-programs case in April 2011 and discussed a bit on the decision theory list, but which turned out to be much more theoretically robust in a setting with a halting oracle, which came up in another discussion on the decision theory list in December 2011, which Slepnev wrote up in A model of UDT with a halting oracle. My later write up was Predictability of Decisions and the Diagonal Method. (6) Armed with the diagonal trick (chicken rule), Stiennon wrote up cooperation in PD for the oracle case, which was more theoretically tractable than Slepnev’s earlier no-oracle solution for PD. (7) At this point, we had both a formalization of UDT that didn’t suffer from the spurious proofs problem, and an illustration of how it can be applied to a non-trivial problem like PD.” 
  2. Some researchers might instead say that Slepnev’s August 2010 post “What a reduction of ‘could’ could look like” presented the “first formal model” of UDT. 
  3. Stiennon’s post also improved the formalization by using a two-step “chicken rule” rather than a one-step chicken rule.