MSFP is an extended retreat for mathematicians and programmers with a serious interest in making technical progress on the problem of AI alignment. It includes an overview of CFAR’s applied rationality content, a breadth-first grounding in the MIRI perspective on AI safety, and multiple days of actual hands-on research with participants and MIRI staff attempting to make inroads on open questions.
MIRIx is a program where MIRI helps cover basic expenses for outside groups that want to work on open problems in AI safety. You can start your own group or find information on existing meet-ups at intelligence.org/mirix.
Several MIRIx groups have recently been ramping up their activity, including:
- UC Irvine: Daniel Hermann is starting a MIRIx group in Irvine, California. Contact him if you’d like to be involved.
- Seattle: MIRIxSeattle is a small group that’s in the process of restarting and increasing its activities. Contact Pasha Kamyshev if you’re interested.
- Vancouver: Andrew McKnight and Evan Gaensbauer are looking for more people who’d like to join MIRIxVancouver events.
The new alignment field guide is intended to provide tips and background models to MIRIx groups, based on our experience of what tends to make a research group succeed or fail.
The guide begins:
Preamble I: Decision Theory
Hello! You may notice that you are reading a document.
This fact comes with certain implications. For instance, why are you reading this? Will you finish it? What decisions will you come to as a result? What will you do next?
Notice that, whatever you end up doing, it’s likely that there are dozens or even hundreds of other people, quite similar to you and in quite similar positions, who will follow reasoning which strongly resembles yours, and make choices which correspondingly match.
Given that, it’s our recommendation that you make your next few decisions by asking the question “What policy, if followed by all agents similar to me, would result in the most good, and what does that policy suggest in my particular case?” It’s less of a question of trying to decide for all agents sufficiently-similar-to-you (which might cause you to make the wrong choice out of guilt or pressure) and more something like “if I were in charge of all agents in my reference class, how would I treat instances of that class with my specific characteristics?”
If that kind of thinking leads you to read further, great. If it leads you to set up a MIRIx chapter, even better. In the meantime, we will proceed as if the only people reading this document are those who justifiably expect to find it reasonably useful.
Preamble II: Surface Area
Imagine that you have been tasked with moving a cube of solid iron that is one meter on a side. Given that such a cube weighs ~16000 pounds, and that an average human can lift ~100 pounds, a naïve estimation tells you that you can solve this problem with ~150 willing friends.
But of course, a meter cube can fit at most something like 10 people around it. It doesn’t matter if you have the theoretical power to move the cube if you can’t bring that power to bear in an effective manner. The problem is constrained by its surface area.
MIRIx chapters are one of the best ways to increase the surface area of people thinking about and working on the technical problem of AI alignment. And just as it would be a bad idea to decree “the 10 people who happen to currently be closest to the metal cube are the only ones allowed to think about how to think about this problem”, we don’t want MIRI to become the bottleneck or authority on what kinds of thinking can and should be done in the realm of embedded agency and other relevant fields of research.
The hope is that you and others like you will help actually solve the problem, not just follow directions or read what’s already been written. This document is designed to support people who are interested in doing real groundbreaking research themselves.
Human values and preferences are hard to specify, especially in complex domains. Accordingly, much AGI safety research has focused on approaches to AGI design that refer to human values and preferences indirectly, by learning a model that is grounded in expressions of human values (via stated preferences, observed behaviour, approval, etc.) and/or real-world processes that generate expressions of those values. There are additionally approaches aimed at modelling or imitating other aspects of human cognition or behaviour without an explicit aim of capturing human preferences (but usually in service of ultimately satisfying them). Let us refer to all these models as human models.
In this post, we discuss several reasons to be cautious about AGI designs that use human models. We suggest that the AGI safety research community put more effort into developing approaches that work well in the absence of human models, alongside the approaches that rely on human models. This would be a significant addition to the current safety research landscape, especially if we focus on working out and trying concrete approaches as opposed to developing theory. We also acknowledge various reasons why avoiding human models seems difficult.
Problems with Human Models
To be clear about human models, we draw a rough distinction between our actual preferences (which may not be fully accessible to us) and procedures for evaluating our preferences. The first thing, actual preferences, is what humans actually want upon reflection. Satisfying our actual preferences is a win. The second thing, procedures for evaluating preferences, refers to various proxies for our actual preferences such as our approval, or what looks good to us (with necessarily limited information or time for thinking). Human models are in the second category; consider, as an example, a highly accurate ML model of human yes/no approval on the set of descriptions of outcomes. Our first concern, described below, is about overfitting to human approval and thereby breaking its connection to our actual preferences. (This is a case of Goodhart’s law.)
Our 2018 Fundraiser ended on December 31 with the five week campaign raising $951,8171 from 348 donors to help advance MIRI’s mission. We surpassed our Mainline Target ($500k) and made it more than halfway again to our Accelerated Growth Target ($1.2M). We’re grateful to all of you who supported us. Thank you!
348 donors contributed
With cryptocurrency prices significantly lower than during our 2017 fundraiser, we received less of our funding (~6%) from holders of cryptocurrency this time around. Despite this, our fundraiser was a success, in significant part thanks to the leverage gained by MIRI supporters’ participation in multiple matching campaigns during the fundraiser, including WeTrust Spring’s Ethereum-matching campaign, Facebook’s Giving Tuesday event and professional poker player Dan Smith’s Double Up Drive, expertly administered by Raising for Effective Giving.
- Map and Territory is:
- How to Actually Change Your Mind is:
- $8 on Amazon, for the print version.
- Pay-what-you-on Gumroad, for PDF, EPUB, and MOBI versions (available in the next day).