# MIRI’s 2018 Fundraiser

|   |  News

Update January 2019: MIRI’s 2018 fundraiser is now concluded.

MIRI is a math/CS research nonprofit with a mission of maximizing the potential humanitarian benefit of smarter-than-human artificial intelligence. You can learn more about the kind of work we do in “Ensuring Smarter-Than-Human Intelligence Has A Positive Outcome” and “Embedded Agency.”

Our funding targets this year are based on a goal of raising enough in 2018 to match our “business-as-usual” budget next year. We view “make enough each year to pay for the next year” as a good heuristic for MIRI, given that we’re a quickly growing nonprofit with a healthy level of reserves and a budget dominated by researcher salaries.

# 2018 Update: Our New Research Directions

|   |  MIRI Strategy, News

For many years, MIRI’s goal has been to resolve enough fundamental confusions around alignment and intelligence to enable humanity to think clearly about technical AI safety risks—and to do this before this technology advances to the point of potential catastrophe. This goal has always seemed to us to be difficult, but possible.1

Last year, we said that we were beginning a new research program aimed at this goal.2 Here, we’re going to provide background on how we’re thinking about this new set of research directions, lay out some of the thinking behind our recent decision to do less default sharing of our research, and make the case for interested software engineers to join our team and help push our understanding forward.

1. This post is an amalgam put together by a variety of MIRI staff. The byline saying “Nate” means that I (Nate) endorse the post, and that many of the concepts and themes come in large part from me, and I wrote a decent number of the words. However, I did not write all of the words, and the concepts and themes were built in collaboration with a bunch of other MIRI staff. (This is roughly what bylines have meant on the MIRI blog for a while now, and it’s worth noting explicitly.) []
2. See our 2017 strategic update and fundraiser posts for more details. []

# Embedded Curiosities

|   |  Analysis

This is the conclusion of the Embedded Agency series. Previous posts:

A final word on curiosity, and intellectual puzzles:

I described an embedded agent, Emmy, and said that I don’t understand how she evaluates her options, models the world, models herself, or decomposes and solves problems.

In the past, when researchers have talked about motivations for working on problems like these, they’ve generally focused on the motivation from AI risk. AI researchers want to build machines that can solve problems in the general-purpose fashion of a human, and dualism is not a realistic framework for thinking about such systems. In particular, it’s an approximation that’s especially prone to breaking down as AI systems get smarter. When people figure out how to build general AI systems, we want those researchers to be in a better position to understand their systems, analyze their internal properties, and be confident in their future behavior.

This is the motivation for most researchers today who are working on things like updateless decision theory and subsystem alignment. We care about basic conceptual puzzles which we think we need to figure out in order to achieve confidence in future AI systems, and not have to rely quite so much on brute-force search or trial and error.

But the arguments for why we may or may not need particular conceptual insights in AI are pretty long. I haven’t tried to wade into the details of that debate here. Instead, I’ve been discussing a particular set of research directions as an intellectual puzzle, and not as an instrumental strategy.

One downside of discussing these problems as instrumental strategies is that it can lead to some misunderstandings about why we think this kind of work is so important. With the “instrumental strategies” lens, it’s tempting to draw a direct line from a given research problem to a given safety concern. But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.

What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.

If AI developers in the future are still working with these confused and incomplete basic concepts as they try to actually build powerful real-world optimizers, that seems like a bad position to be in. And it seems like the research community is unlikely to figure most of this out by default in the course of just trying to develop more capable systems. Evolution certainly figured out how to build human brains without “understanding” any of this, via brute-force search.

Embedded agency is my way of trying to point at what I think is a very important and central place where I feel confused, and where I think future researchers risk running into confusions too.

There’s also a lot of excellent AI alignment research that’s being done with an eye toward more direct applications; but I think of that safety research as having a different type signature than the puzzles I’ve talked about here.

Intellectual curiosity isn’t the ultimate reason we privilege these research directions. But there are some practical advantages to orienting toward research questions from a place of curiosity at times, as opposed to only applying the “practical impact” lens to how we think about the world.

When we apply the curiosity lens to the world, we orient toward the sources of confusion preventing us from seeing clearly; the blank spots in our map, the flaws in our lens. It encourages re-checking assumptions and attending to blind spots, which is helpful as a psychological counterpoint to our “instrumental strategy” lens—the latter being more vulnerable to the urge to lean on whatever shaky premises we have on hand so we can get to more solidity and closure in our early thinking.

Embedded agency is an organizing theme behind most, if not all, of our big curiosities. It seems like a central mystery underlying many concrete difficulties.

# Subsystem Alignment

|   |  Analysis

You want to figure something out, but you don’t know how to do that yet.

You have to somehow break up the task into sub-computations. There is no atomic act of “thinking”; intelligence must be built up of non-intelligent parts.

The agent being made of parts is part of what made counterfactuals hard, since the agent may have to reason about impossible configurations of those parts.

Being made of parts is what makes self-reasoning and self-modification even possible.

What we’re primarily going to discuss in this section, though, is another problem: when the agent is made of parts, there could be adversaries not just in the external environment, but inside the agent as well.

This cluster of problems is Subsystem Alignment: ensuring that subsystems are not working at cross purposes; avoiding subprocesses optimizing for unintended goals.

• benign induction
• benign optimization
• transparency
• inner optimizers

# Robust Delegation

|   |  Analysis

Because the world is big, the agent as it is may be inadequate to accomplish its goals, including in its ability to think.

Because the agent is made of parts, it can improve itself and become more capable.

Improvements can take many forms: The agent can make tools, the agent can make successor agents, or the agent can just learn and grow over time. However, the successors or tools need to be more capable for this to be worthwhile.

This gives rise to a special type of principal/agent problem:

You have an initial agent, and a successor agent. The initial agent gets to decide exactly what the successor agent looks like. The successor agent, however, is much more intelligent and powerful than the initial agent. We want to know how to have the successor agent robustly optimize the initial agent’s goals.

The problem is not (just) that the successor agent might be malicious. The problem is that we don’t even know what it means not to be.

This problem seems hard from both points of view.

The initial agent needs to figure out how reliable and trustworthy something more powerful than it is, which seems very hard. But the successor agent has to figure out what to do in situations that the initial agent can’t even understand, and try to respect the goals of something that the successor can see is inconsistent, which also seems very hard.

At first, this may look like a less fundamental problem than “make decisions” or “have models”. But the view on which there are multiple forms of the “build a successor” problem is a dualistic view.

To an embedded agent, the future self is not privileged; it is just another part of the environment. There isn’t a deep difference between building a successor that shares your goals, and just making sure your own goals stay the same over time.

So, although I talk about “initial” and “successor” agents, remember that this isn’t just about the narrow problem humans currently face of aiming a successor. This is about the fundamental problem of being an agent that persists and learns over time.

We call this cluster of problems Robust Delegation. Examples include:

# Embedded World-Models

|   |  Analysis

An agent which is larger than its environment can:

• Hold an exact model of the environment in its head.
• Think through the consequences of every potential course of action.
• If it doesn’t know the environment perfectly, hold every possible way the environment could be in its head, as is the case with Bayesian uncertainty.

All of these are typical of notions of rational agency.

An embedded agent can’t do any of those things, at least not in any straightforward way.

One difficulty is that, since the agent is part of the environment, modeling the environment in every detail would require the agent to model itself in every detail, which would require the agent’s self-model to be as “big” as the whole agent. An agent can’t fit inside its own head.

The lack of a crisp agent/environment boundary forces us to grapple with paradoxes of self-reference. As if representing the rest of the world weren’t already hard enough.

Embedded World-Models have to represent the world in a way more appropriate for embedded agents. Problems in this cluster include:

• the “realizability” / “grain of truth” problem: the real world isn’t in the agent’s hypothesis space
• logical uncertainty
• high-level models
• multi-level models
• ontological crises
• naturalized induction, the problem that the agent must incorporate its model of itself into its world-model
• anthropic reasoning, the problem of reasoning with how many copies of yourself exist

# Decision Theory

|   |  Analysis

Decision theory and artificial intelligence typically try to compute something resembling

$$\underset{a \ \in \ Actions}{\mathrm{argmax}} \ \ f(a).$$

I.e., maximize some function of the action. This tends to assume that we can detangle things enough to see outcomes as a function of actions.

For example, AIXI represents the agent and the environment as separate units which interact over time through clearly defined i/o channels, so that it can then choose actions maximizing reward.

When the agent model is a part of the environment model, it can be significantly less clear how to consider taking alternative actions.

For example, because the agent is smaller than the environment, there can be other copies of the agent, or things very similar to the agent. This leads to contentious decision-theory problems such as the Twin Prisoner’s Dilemma and Newcomb’s problem.

If Emmy Model 1 and Emmy Model 2 have had the same experiences and are running the same source code, should Emmy Model 1 act like her decisions are steering both robots at once? Depending on how you draw the boundary around “yourself”, you might think you control the action of both copies, or only your own.

Problems of adapting decision theory to embedded agents include:

• counterfactuals
• Newcomblike reasoning, in which the agent interacts with copies of itself
• reasoning about other agents more broadly
• extortion problems
• coordination problems
• logical counterfactuals
• logical updatelessness