2018 research plans and predictions

 |   |  MIRI Strategy

Update Nov. 23: This post was edited to reflect Scott’s terminology change from “naturalized world-models” to “embedded world-models.” For a full introduction to these four research problems, see Scott Garrabrant and Abram Demski’s “Embedded Agency.”


Scott Garrabrant is taking over Nate Soares’ job of making predictions about how much progress we’ll make in different research areas this year. Scott divides MIRI’s alignment research into five categories:


embedded world-models — Problems related to modeling large, complex physical environments that lack a sharp agent/environment boundary. Central examples of problems in this category include logical uncertainty, naturalized induction, multi-level world models, and ontological crises.

Introductory resources: “Formalizing Two Problems of Realistic World-Models,” “Questions of Reasoning Under Logical Uncertainty,” “Logical Induction,” “Reflective Oracles

Examples of recent work: “Hyperreal Brouwer,” “An Untrollable Mathematician,” “Further Progress on a Bayesian Version of Logical Uncertainty


decision theory — Problems related to modeling the consequences of different (actual and counterfactual) decision outputs, so that the decision-maker can choose the output with the best consequences. Central problems include counterfactuals, updatelessness, coordination, extortion, and reflective stability.

Introductory resources: “Cheating Death in Damascus,” “Decisions Are For Making Bad Outcomes Inconsistent,” Functional Decision Theory

Examples of recent work: Cooperative Oracles,” “Smoking Lesion Steelman” (1, 2), “The Happy Dance Problem,” “Reflective Oracles as a Solution to the Converse Lawvere Problem


robust delegation — Problems related to building highly capable agents that can be trusted to carry out some task on one’s behalf. Central problems include corrigibility, value learning, informed oversight, and Vingean reflection.

Introductory resources: The Value Learning Problem,” “Corrigibility,” “Problem of Fully Updated Deference,” “Vingean Reflection,” “Using Machine Learning to Address AI Risk

Examples of recent work: “Categorizing Variants of Goodhart’s Law,” “Stable Pointers to Value


subsystem alignment — Problems related to ensuring that an AI system’s subsystems are not working at cross purposes, and in particular that the system avoids creating internal subprocesses that optimize for unintended goals. Central problems include benign induction.

Introductory resources: What Does the Universal Prior Actually Look Like?”, “Optimization Daemons,” “Modeling Distant Superintelligences

Examples of recent work: Some Problems with Making Induction Benign


other — Alignment research that doesn’t fall into the above categories. If we make progress on the open problems described in Alignment for Advanced ML Systems,” and the progress is less connected to our agent foundations work and more ML-oriented, then we’ll likely classify it here.


Read more »

New paper: “Categorizing variants of Goodhart’s Law”

 |   |  Papers

Categorizing Variants of Goodhart's LawGoodhart’s Law states that “any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.” However, this is not a single phenomenon. In Goodhart Taxonomy, I proposed that there are (at least) four different mechanisms through which proxy measures break when you optimize for them: Regressional, Extremal, Causal, and Adversarial.

David Manheim has now helped write up my taxonomy as a paper going into more detail on the these mechanisms: “Categorizing variants of Goodhart’s Law.” From the conclusion:

This paper represents an attempt to categorize a class of simple statistical misalignments that occur both in any algorithmic system used for optimization, and in many human systems that rely on metrics for optimization. The dynamics highlighted are hopefully useful to explain many situations of interest in policy design, in machine learning, and in specific questions about AI alignment.

In policy, these dynamics are commonly encountered but too-rarely discussed clearly. In machine learning, these errors include extremal Goodhart effects due to using limited data and choosing overly parsimonious models, errors that occur due to myopic consideration of goals, and mistakes that occur when ignoring causality in a system. Finally, in AI alignment, these issues are fundamental to both aligning systems towards a goal, and assuring that the system’s metrics do not have perverse effects once the system begins optimizing for them.

Let V refer to the true goal, while U refers to a proxy for that goal which was observed to correlate with V and which is being optimized in some way. Then the four subtypes of Goodhart’s Law are as follows:


Regressional Goodhart — When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.

  • Model: When U is equal to V + X, where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can expect V to be predictably smaller than U.
  • Example: Height is correlated with basketball ability, and does actually directly help, but the best player is only 6’3″, and a random 7′ person in their 20s would probably not be as good.

Extremal Goodhart — Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.

  • Model: Patterns tend to break at simple joints. One simple subset of worlds is those worlds in which U is very large. Thus, a strong correlation between U and V observed for naturally occuring U values may not transfer to worlds in which U is very large. Further, since there may be relatively few naturally occuring worlds in which U is very large, extremely large U may coincide with small V values without breaking the statistical correlation.
  • Example: The tallest person on record, Robert Wadlow, was 8’11” (2.72m). He grew to that height because of a pituitary disorder; he would have struggled to play basketball because he “required leg braces to walk and had little feeling in his legs and feet.”

Causal Goodhart — When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.

  • Model: If V causes U (or if V and U are both caused by some third thing), then a correlation between V and U may be observed. However, when you intervene to increase U through some mechanism that does not involve V, you will fail to also increase V.
  • Example: Someone who wishes to be taller might observe that height is correlated with basketball skill and decide to start practicing basketball.

Adversarial Goodhart — When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.

  • Model: Consider an agent A with some different goal W. Since they depend on common resources, W and V are naturally opposed. If you optimize U as a proxy for V, and A knows this, A is incentivized to make large U values coincide with large W values, thus stopping them from coinciding with large V values.
  • Example: Aspiring NBA players might just lie about their height.

For more on this topic, see Eliezer Yudkowsky’s write-up, Goodhart’s Curse.
 

Sign up to get updates on new MIRI technical results

Get notified every time a new technical paper is published.

 

March 2018 Newsletter

 |   |  Newsletters

Sam Harris and Eliezer Yudkowsky on “AI: Racing Toward the Brink”

 |   |  Conversations

Waking Up with Sam Harris

MIRI senior researcher Eliezer Yudkowsky was recently invited to be a guest on Sam Harris’ “Waking Up” podcast. Sam is a neuroscientist and popular author who writes on topics related to philosophy, religion, and public discourse.

The following is a complete transcript of Sam and Eliezer’s conversation, AI: Racing Toward the Brink.

Contents

Read more »

February 2018 Newsletter

 |   |  Newsletters

January 2018 Newsletter

 |   |  Newsletters

Fundraising success!

 |   |  News

Our 2017 fundraiser is complete! We’ve had an incredible month, with, by far, our largest fundraiser success to date. More than 300 distinct donors gave just over $2.5M1, doubling our third fundraising target of $1.25M. Thank you!

 

$2,504,625 raised in total!

358 donors contributed

 

Our largest donation came toward the very end of the fundraiser in the form of an Ethereum donation worth $763,970 from Vitalik Buterin, the inventor and co-founder of Ethereum. Vitalik’s donation represents the third-largest single contribution we’ve received to date, after a $1.25M grant disbursement from the Open Philanthropy Project in October, and a $1.01M Ethereum donation in May.

In our mid-fundraiser update, we noted that MIRI was included in a large Matching Challenge: In partnership with Raising For Effective Giving, professional poker players Dan Smith, Tom Crowley and Martin Crowley announced they would match all donations to MIRI and nine other organizations through the end of December. Donors helped get us to our matching cap of $300k within 2 weeks, resulting in a $300k match from Dan, Tom, and Martin (thanks guys!). Other big winners from the Matching Challenge, which raised $4.5m (match included) in less than 3 weeks, include GiveDirectly ($588k donated) and the Good Food Institute ($416k donated).

Other big donations we received in December included:

We also received substantial support from medium-sized donors: a total of $631,595 from the 42 donors who gave $5,000–$50,000 and a total of $113,556 from the 75 who gave $500–$5,000 (graph). We also are grateful to donors who leveraged their employers’ matching generosity, donating a combined amount of over $100,000 during December.

66% of funds donated during this fundraiser were in the form of cryptocurrency (mainly Bitcoin and Ethereum), including Vitalik, Marius, and Christian’s donations, along with Dan, Tom, and Martin’s matching contributions.

Overall, we’ve had an amazingly successful month and a remarkable year! I’m extremely grateful for all the support we’ve received, and excited about the opportunity this creates for us to grow our research team more quickly. For details on our growth plans, see our fundraiser post.


  1. The exact total might increase slightly over the coming weeks as we process donations initiated in December 2017 that arrive in January 2018. 

End-of-the-year matching challenge!

 |   |  News

Update 2017-12-27: We’ve blown past our 3rd and final target, and reached the matching cap of $300,000 for the Matching Challenge! Thanks so much to everyone who supported us!

All donations made before 23:59 PST on Dec 31st will continue to be counted towards our fundraiser total. The fundraiser total includes projected matching funds from the Challenge.


 
 

Professional poker players Martin Crowley, Tom Crowley, and Dan Smith, in partnership with Raising for Effective Giving, have just announced a $1 million Matching Challenge and included MIRI among the 10 organizations they are supporting!

Give to any of the organizations involved before noon (PST) on December 31 for your donation to be eligible for a dollar-for-dollar match, up to the $1 million limit!

The eligible organizations for matching are:

  • Animal welfare — Effective Altruism Funds’ animal welfare fund, The Good Food Institute
  • Global health and development — Against Malaria Foundation, Schistosomiasis Control Initiative, Helen Keller International’s vitamin A supplementation program, GiveDirectly
  • Global catastrophic risk — MIRI
  • Criminal justice reform — Brooklyn Community Bail Fund, Massachusetts Bail Fund, Just City Memphis

The Matching Challenge’s website lists two options for MIRI donors to get matched: (1) donating on 2017charitydrive.com, or (2) donating directly on MIRI’s website and sending the receipt to receiptsforcharity@gmail.com. We recommend option 2, particularly for US tax residents (because MIRI is a 501(c)(3) organization) and those looking for a wider array of payment methods.

 

In other news, we’ve hit our first fundraising target ($625,000)!

We’re also happy to announce that we’ve received a $368k bitcoin donation from Christian Calderon, a cryptocurrency enthusiast, and also a donation worth $59k from early bitcoin investor Marius van Voorden.

In total, so far, we’ve received donations valued $697,638 from 137 distinct donors, 76% of it in the form of cryptocurrency (48% if we exclude Christian’s donation). Thanks as well to Jacob Falkovich for his fundraiser/matching post whose opinion distribution curves plausibly raised over $27k for MIRI this week, including his match.

Our funding drive will be continuing through the end of December, along with the Matching Challenge. Current progress (updated live):

 



 

Correction December 17: I previously listed GiveWell as one of the eligible organizations for matching, which is not correct.