September 2021 Newsletter

September 29, 2021 | Rob Bensinger | Newsletters

Scott Garrabrant has concluded the main section of his Finite Factored Sets sequence (“Details and Proofs”) with posts on inferring time and applications, future work, and speculation.

Scott’s new frameworks are also now available as a pair of arXiv papers: “Cartesian Frames” (adapted from the Cartesian Frames sequence for a philosopher audience by Daniel Hermann and Josiah Lopez-Wild) and “Temporal Inference with Finite Factored Sets” (essentially identical to the “Details and Proofs” section of Scott’s sequence).

Other MIRI updates

DeepMind’s Rohin Shah has written his own introduction to finite factored sets.
Alex Appel extends the idea of finite factored sets to countable-dimensional factored spaces.
Open Philanthropy’s Joe Carlsmith has written what’s probably the best existing introduction to MIRI-cluster work on decision theory: Can You Control the Past?. See also Carlsmith’s decision theory conversation with MIRI’s Abram Demski and Scott Garrabrant.
From social media: Eliezer Yudkowsky discusses paths to AGI and the ignorance argument for long timelines, and talks with Vitalik Buterin about GPT-3 and pivotal acts.

News and links

A solid new introductory resource: Holden Karnofsky has written a series of essays on his new blog (Cold Takes) arguing that “the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement”. See also Holden’s conversation with Rob Wiblin on the 80,000 Hours Podcast.
The Future of Life Institute announces the Vitalik Buterin PhD Fellowship in AI Existential Safety, “targeted at students applying to start their PhD in 2022”. You can apply at https://grants.futureoflife.org/; the deadline is Nov. 5.
OpenAI releases Codex, “a GPT language model fine-tuned on publicly available code from GitHub”.

August 2021 Newsletter

August 31, 2021 | Rob Bensinger | Newsletters

July 2021 Newsletter

August 3, 2021 | Rob Bensinger | Newsletters

June 2021 Newsletter

July 1, 2021 | Rob Bensinger | Newsletters

Our big news this month is Scott Garrabrant's finite factored sets, one of MIRI's largest results to date.

For most people, the best introductory resource on FFS is likely Scott’s Topos talk/transcript. Scott is also in the process of posting a longer, more mathematically dense introduction in multiple parts: part 1, part 2.

Scott has also discussed factored sets with Daniel Filan on the AI X-Risk Podcast, and in a LessWrong talk/transcript.

Other MIRI updates

On MIRI researcher Abram Demski’s view, the core inner alignment problem is the absence of robust safety arguments “in a case where we might naively expect it. We don't know how to rule out the presence of (misaligned) mesa-optimizers.” Abram advocates a more formal approach to the problem:

Most of the work on inner alignment so far has been informal or semi-formal (with the notable exception of a little work on minimal circuits). I feel this has resulted in some misconceptions about the problem. I want to write up a large document clearly defining the formal problem and detailing some formal directions for research. Here, I outline my intentions, inviting the reader to provide feedback and point me to any formal work or areas of potential formal work which should be covered in such a document.
Mark Xu writes An Intuitive Guide to Garrabrant Induction (a.k.a. logical induction).
MIRI research associate Ramana Kumar has formalized the ideas in Scott Garrabrant’s Cartesian Frames sequence in higher-order logic, “including machine verified proofs of all the theorems”.
Independent researcher Alex Flint writes on probability theory and logical induction as lenses and on gradations of inner alignment obstacles.
I (Rob) asked 44 people working on long-term AI risk about the level of existential risk from AI (EA Forum link, LW link). Responses were all over the map (with MIRI more pessimistic than most organizations). The mean respondent’s probability of existential catastrophe from “AI systems not doing/optimizing what the people deploying them wanted/intended” was ~40%, median 30%. (See also the independent survey by Clarke, Carlier, and Schuett.)
MIRI recently spent some time seriously evaluating whether to move out of the Bay Area. We’ve now decided to stay in the Bay. For more details, see MIRI board member Blake Borgeson’s update.

News and links

Dario and Daniela Amodei, formerly at OpenAI, have launched a new organization, Anthropic, with a goal of doing “computationally-intensive research to develop large-scale AI systems that are steerable, interpretable, and robust”.
Jonas Vollmer writes that the Long-Term Future Fund and the Effective Altruism Infrastructure Fund are now looking for grant applications: "We fund student scholarships, career exploration, local groups, entrepreneurial projects, academic teaching buy-outs, top-up funding for poorly paid academics, and many other things. We can make anonymous grants without public reporting. We will consider grants as low as $1,000 or as high as $500,000 (or more in some cases). As a reminder, EA Funds is more flexible than you might think." Going forward, these two funds will accept applications at any time, rather than having distinct grant rounds. You can apply here.

Finite Factored Sets

May 23, 2021 | Scott Garrabrant | Papers

This is the edited transcript of a talk introducing finite factored sets. For most readers, it will probably be the best starting point for learning about factored sets.

Video:

(Lightly edited) slides: https://intelligence.org/files/Factored-Set-Slides.pdf

(Part 1, Title Slides) · · · Finite Factored Sets

(Part 1, Motivation) · · · Some Context

Scott: So I want to start with some context. For people who are not already familiar with my work:

My main motivation is to reduce existential risk.
I try to do this by trying to figure out how to align advanced artificial intelligence.
I try to do this by trying to become less confused about intelligence and optimization and agency and various things in that cluster.
My main strategy here is to develop a theory of agents that are embedded in the environment that they’re optimizing. I think there are a lot of open hard problems around doing this.
This leads me to do a bunch of weird math and philosophy. This talk is going to be an example of some weird math and philosophy.

For people who are already familiar with my work, I just want to say that according to my personal aesthetics, the subject of this talk is about as exciting as Logical Induction, which is to say I’m really excited about it. And I’m really excited about this audience; I’m excited to give this talk right now.

May 2021 Newsletter

May 18, 2021 | Rob Bensinger | Newsletters

MIRI senior researcher Scott Garrabrant has a major new result, “Finite Factored Sets,” that he’ll be unveiling in an online talk this Sunday at noon Pacific time. (Zoom link.) For context on the result, see Scott’s new post “Saving Time.”

In other big news, MIRI has just received its two largest individual donations of all time! Ethereum inventor Vitalik Buterin has donated ~$4.3 million worth of ETH to our research program, while an anonymous long-time supporter has donated MKR tokens we liquidated for an astounding ~$15.6 million. The latter donation is restricted so that we can spend a maximum of $2.5 million of it per year until 2025, like a multi-year grant.

Both donors have our massive thanks for these incredible gifts to support our work!

Other MIRI updates

Mark Xu and Evan Hubinger use “Cartesian world models” to distinguish “consequential agents” (which assign utility to environment states, internal states, observations, and/or actions) “structural agents” (which optimize “over the set of possible decide functions instead of the set of possible actions”), and “conditional agents” (which map e.g. environmental states to utility functions, rather than mapping them to utility).
In Gradations of Inner Alignment Obstacles, Abram Demski makes three “contentious claims”:

The most useful definition of “mesa-optimizer” doesn’t require them to perform explicit search, contrary to the current standard.

Success at aligning narrowly superhuman models might be bad news.

Some versions of the lottery ticket hypothesis seem to imply that randomly initialized networks already contain deceptive agents.

Eliezer Yudkowsky comments on the relationship between early AGI systems’ alignability and capabilities.

News and links

John Wentworth announces a project to test the natural abstraction hypothesis, which asserts that “most high-level abstract concepts used by humans are ‘natural'” and therefore “a wide range of architectures will reliably learn similar high-level concepts”.
Open Philanthropy’s Joe Carlsmith asks “Is Power-Seeking AI an Existential Risk?“, and Luke Muehlhauser asks for examples of treacherous turns in the wild (also on LessWrong).
From DeepMind’s safety researchers: What Mechanisms Drive Agent Behavior?, Alignment of Language Agents, and An EPIC Way to Evaluate Reward Functions. Also, Rohin Shah provides his advice on entering the field.
Owen Shen and Peter Hase summarize 70 recent papers on model transparency, interpretability, and explainability.
Eli Tyre asks: How do we prepare for final crunch time? (I would add some caveats: Some roles and scenarios imply that you’ll have less impact on the eve of AGI, and can have far more impact today. For some people, “final crunch time” may be now, and marginal efforts matter less later. Further, some forms of “preparing for crunch time” will fail if there aren’t clear warning shots or fire alarms.)
Paul Christiano launches a new organization that will be his focus going forward: the Alignment Research Center. Learn more about Christiano’s research approach in My Research Methodology and in his recent AMA.

Saving Time

May 18, 2021 | Scott Garrabrant | Analysis

Note: This is a preamble to Finite Factored Sets, a sequence I’ll be posting over the next few weeks. This Sunday at noon Pacific time, I’ll be giving a Zoom talk (link) introducing Finite Factored Sets, a framework which I find roughly as technically interesting as logical induction.

(Update May 25: A video and blog post introducing Finite Factored Sets is now available here.)

For the last few years, a large part of my research motivation has been directed at trying to save the concept of time—save it, for example, from all the weird causal loops created by decision theory problems. This post will hopefully explain why I care so much about time, and what I think needs to be fixed.

Why Time?

My best attempt at a short description of time is that time is causality. For example, in a Pearlian Bayes net, you draw edges from earlier nodes to later nodes. To the extent that we want to think about causality, then, we will need to understand time.

Importantly, time is the substrate in which learning and commitments take place. When agents learn, they learn over time. The passage of time is like a ritual in which opportunities are destroyed and knowledge is created. And I think that many models of learning are subtly confused, because they are based on confused notions of time.

Time is also crucial for thinking about agency. My best short-phrase definition of agency is that agency is time travel. An agent is a mechanism through which the future is able to affect the past. An agent models the future consequences of its actions, and chooses actions on the basis of those consequences. In that sense, the consequence causes the action, in spite of the fact that the action comes earlier in the standard physical sense.

Problem: Time is Loopy

The main thing going wrong with time is that it is “loopy.”

The primary confusing thing about Newcomb’s problem is that we want to think of our decision as coming “before” the filling of the boxes, in spite of the fact that it physically comes after. This is hinting that maybe we want to understand some other “logical” time in addition to the time of physics.

However, when we attempt to do this, we run into two problems: Firstly, we don’t understand where this logical time might come from, or how to learn it, and secondly, we run into some apparent temporal loops.

I am going to set aside the first problem and focus on the second.

The easiest way to see why we run into temporal loops is to notice that it seems like physical time is at least a little bit entangled with logical time.

Imagine the point of view of someone running a physics simulation of Newcomb’s problem, and tracking all of the details of all of the atoms. From that point of view, it seems like there is a useful sense in which the filling of the boxes comes before an agent’s decision to one-box or two-box. At the same time, however, those atoms compose an agent that shouldn’t make decisions as though it were helpless to change anything.

Maybe the solution here is to think of there being many different types of “before” and “after,” “cause” and “effect,” etc. For example, we could say that X is before Y from an agent-first perspective, but Y is before X from a physics-first perspective.

I think this is right, and we want to think of there as being many different systems of time (hopefully predictably interconnected). But I don’t think this resolves the whole problem.

Consider a pair of FairBot agents that successfully execute a Löbian handshake to cooperate in an open-source prisoner’s dilemma. I want to say that each agent’s cooperation causes the other agent’s cooperation in some sense. I could say that relative to each agent the causal/temporal ordering goes a different way, but I think the loop is an important part of the structure in this case. (I also am not even sure which direction of time I would want to associate with which agent.)

We also are tempted to put loops in our time/causality for other reasons. For example, when modeling a feedback loop in a system that persists over time, we might draw structures that look a lot like a Bayes net, but are not acyclic (e.g., a POMDP). We could think of this as a projection of another system that has an extra dimension of time, but it is a useful projection nonetheless.

Solution: Abstraction

My main hope for recovering a coherent notion of time and unraveling these temporal loops is via abstraction.

In the example where the agent chooses actions based on their consequences, I think that there is an abstract model of the consequences that comes causally before the choice of action, which comes before the actual physical consequences.

In Newcomb’s problem, I want to say that there is an abstract model of the action that comes causally before the filling of the boxes.

In the open source prisoners’ dilemma, I want to say that there is an abstract proof of cooperation that comes causally before the actual program traces of the agents.

All of this is pointing in the same direction: We need to have coarse abstract versions of structures come at a different time than more refined versions of the same structure. Maybe when we correctly allow for different levels of description having different links in the causal chain, we can unravel all of the time loops.

But How?

Unfortunately, our best understanding of time is Pearlian causality, and Pearlian causality does not do great with abstraction.

Pearl has Bayes nets with a bunch of variables, but when some of those variables are coarse abstract versions of other variables, then we have to allow for determinism, since some of our variables will be deterministic functions of each other; and the best parts of Pearl do not do well with determinism.

But the problem runs deeper than that. If we draw an arrow in the direction of the deterministic function, we will be drawing an arrow of time from the more refined version of the structure to the coarser version of that structure, which is in the opposite direction of all of our examples.

Maybe we could avoid drawing this arrow from the more refined node to the coarser node, and instead have a path from the coarser node to the refined node. But then we could just make another copy of the coarser node that is deterministically downstream of the more refined node, adding no new degrees of freedom. What is then stopping us from swapping the two copies of the coarser node?

Overall, it seems to me that Pearl is not ready for some of the nodes to be abstract versions of other nodes, which I think needs to be fixed in order to save time.

Discussion on: LessWrong

Our all-time largest donation, and major crypto support from Vitalik Buterin

May 13, 2021 | Colm Ó Riain | News

I’m thrilled to announce two major donations to MIRI!

First, a long-time supporter has given MIRI by far our largest donation ever: $2.5 million per year over the next four years, and an additional ~$5.6 million in 2025.

This anonymous donation comes from a cryptocurrency investor who previously donated $1.01M in ETH to MIRI in 2017. Their amazingly generous new donation comes in the form of 3001 MKR, governance tokens used in MakerDAO, a stablecoin project on the Ethereum blockchain. MIRI liquidated the donated MKR for $15,592,829 after receiving it. With this donation, the anonymous donor becomes our largest all-time supporter.

This donation is subject to a time restriction whereby MIRI can spend a maximum of $2.5M of the gift in each of the next four calendar years, 2021–2024. The remaining $5,592,829 becomes available in 2025.

Second, in other amazing news, the inventor and co-founder of Ethereum, Vitalik Buterin, yesterday gave us a surprise donation of 1050 ETH, worth $4,378,159.

This is the third-largest contribution to MIRI’s research program to date, after Open Philanthropy’s ~$7.7M grant in 2020 and the anonymous donation above.

Vitalik has previously donated over $1M to MIRI, including major support in our 2017 fundraiser.

We’re beyond grateful for these two unprecedented individual gifts! Both donors have our heartfelt thanks.

September 2021 Newsletter

August 2021 Newsletter

July 2021 Newsletter

June 2021 Newsletter

Finite Factored Sets

(Part 1, Title Slides) · · · Finite Factored Sets

(Part 1, Motivation) · · · Some Context

May 2021 Newsletter

Saving Time

Why Time?

Problem: Time is Loopy

Solution: Abstraction

But How?

Our all-time largest donation, and major crypto support from Vitalik Buterin

Search

Browse

Subscribe