# Robust Delegation

|   |  Analysis

Because the world is big, the agent as it is may be inadequate to accomplish its goals, including in its ability to think.

Because the agent is made of parts, it can improve itself and become more capable.

Improvements can take many forms: The agent can make tools, the agent can make successor agents, or the agent can just learn and grow over time. However, the successors or tools need to be more capable for this to be worthwhile.

This gives rise to a special type of principal/agent problem:

You have an initial agent, and a successor agent. The initial agent gets to decide exactly what the successor agent looks like. The successor agent, however, is much more intelligent and powerful than the initial agent. We want to know how to have the successor agent robustly optimize the initial agent’s goals.

The problem is not (just) that the successor agent might be malicious. The problem is that we don’t even know what it means not to be.

This problem seems hard from both points of view.

The initial agent needs to figure out how reliable and trustworthy something more powerful than it is, which seems very hard. But the successor agent has to figure out what to do in situations that the initial agent can’t even understand, and try to respect the goals of something that the successor can see is inconsistent, which also seems very hard.

At first, this may look like a less fundamental problem than “make decisions” or “have models”. But the view on which there are multiple forms of the “build a successor” problem is a dualistic view.

To an embedded agent, the future self is not privileged; it is just another part of the environment. There isn’t a deep difference between building a successor that shares your goals, and just making sure your own goals stay the same over time.

So, although I talk about “initial” and “successor” agents, remember that this isn’t just about the narrow problem humans currently face of aiming a successor. This is about the fundamental problem of being an agent that persists and learns over time.

We call this cluster of problems Robust Delegation. Examples include:

# Embedded World-Models

|   |  Analysis

An agent which is larger than its environment can:

• Hold an exact model of the environment in its head.
• Think through the consequences of every potential course of action.
• If it doesn’t know the environment perfectly, hold every possible way the environment could be in its head, as is the case with Bayesian uncertainty.

All of these are typical of notions of rational agency.

An embedded agent can’t do any of those things, at least not in any straightforward way.

One difficulty is that, since the agent is part of the environment, modeling the environment in every detail would require the agent to model itself in every detail, which would require the agent’s self-model to be as “big” as the whole agent. An agent can’t fit inside its own head.

The lack of a crisp agent/environment boundary forces us to grapple with paradoxes of self-reference. As if representing the rest of the world weren’t already hard enough.

Embedded World-Models have to represent the world in a way more appropriate for embedded agents. Problems in this cluster include:

• the “realizability” / “grain of truth” problem: the real world isn’t in the agent’s hypothesis space
• logical uncertainty
• high-level models
• multi-level models
• ontological crises
• naturalized induction, the problem that the agent must incorporate its model of itself into its world-model
• anthropic reasoning, the problem of reasoning with how many copies of yourself exist

# Decision Theory

|   |  Analysis

Decision theory and artificial intelligence typically try to compute something resembling

$$\underset{a \ \in \ Actions}{\mathrm{argmax}} \ \ f(a).$$

I.e., maximize some function of the action. This tends to assume that we can detangle things enough to see outcomes as a function of actions.

For example, AIXI represents the agent and the environment as separate units which interact over time through clearly defined i/o channels, so that it can then choose actions maximizing reward.

When the agent model is a part of the environment model, it can be significantly less clear how to consider taking alternative actions.

For example, because the agent is smaller than the environment, there can be other copies of the agent, or things very similar to the agent. This leads to contentious decision-theory problems such as the Twin Prisoner’s Dilemma and Newcomb’s problem.

If Emmy Model 1 and Emmy Model 2 have had the same experiences and are running the same source code, should Emmy Model 1 act like her decisions are steering both robots at once? Depending on how you draw the boundary around “yourself”, you might think you control the action of both copies, or only your own.

Problems of adapting decision theory to embedded agents include:

• counterfactuals
• Newcomblike reasoning, in which the agent interacts with copies of itself
• extortion problems
• coordination problems
• logical counterfactuals
• logical updatelessness

# Announcing the new AI Alignment Forum

|   |  Guest Posts, News

This is a guest post by Oliver Habryka, lead developer for LessWrong. Our gratitude to the LessWrong team for the hard work they’ve put into developing this resource, and our congratulations on today’s launch!

I am happy to announce that after two months of open beta, the AI Alignment Forum is launching today. The AI Alignment Forum is a new website built by the team behind LessWrong 2.0, to help create a new hub for technical AI alignment research and discussion.

One of our core goals when we designed the forum was to make it easier for new people to get started on doing technical AI alignment research. This effort was split into two major parts:

# Embedded Agents

|   |  Analysis

Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don’t already know.1

There’s a complicated engineering problem here. But there’s also a problem of figuring out what it even means to build a learning agent like that. What is it to optimize realistic goals in physical environments? In broad terms, how does it work?

In this series of posts, I’ll point to four ways we don’t currently know how it works, and four areas of active research aimed at figuring it out.

This is Alexei, and Alexei is playing a video game.

Like most games, this game has clear input and output channels. Alexei only observes the game through the computer screen, and only manipulates the game through the controller.

The game can be thought of as a function which takes in a sequence of button presses and outputs a sequence of pixels on the screen.

Alexei is also very smart, and capable of holding the entire video game inside his mind. If Alexei has any uncertainty, it is only over empirical facts like what game he is playing, and not over logical facts like which inputs (for a given deterministic game) will yield which outputs. This means that Alexei must also store inside his mind every possible game he could be playing.

Alexei does not, however, have to think about himself. He is only optimizing the game he is playing, and not optimizing the brain he is using to think about the game. He may still choose actions based off of value of information, but this is only to help him rule out possible games he is playing, and not to change the way in which he thinks.

In fact, Alexei can treat himself as an unchanging indivisible atom. Since he doesn’t exist in the environment he’s thinking about, Alexei doesn’t worry about whether he’ll change over time, or about any subroutines he might have to run.

Notice that all the properties I talked about are partially made possible by the fact that Alexei is cleanly separated from the environment that he is optimizing.

1. This is part 1 of the Embedded Agency series, by Abram Demski and Scott Garrabrant.

# The Rocket Alignment Problem

|   |  Analysis

The following is a fictional dialogue building off of AI Alignment: Why It’s Hard, and Where to Start.

(Somewhere in a not-very-near neighboring world, where science took a very different course…)

ALFONSO:  Hello, Beth. I’ve noticed a lot of speculations lately about “spaceplanes” being used to attack cities, or possibly becoming infused with malevolent spirits that inhabit the celestial realms so that they turn on their own engineers.

I’m rather skeptical of these speculations. Indeed, I’m a bit skeptical that airplanes will be able to even rise as high as stratospheric weather balloons anytime in the next century. But I understand that your institute wants to address the potential problem of malevolent or dangerous spaceplanes, and that you think this is an important present-day cause.

BETH:  That’s… really not how we at the Mathematics of Intentional Rocketry Institute would phrase things.

The problem of malevolent celestial spirits is what all the news articles are focusing on, but we think the real problem is something entirely different. We’re worried that there’s a difficult, theoretically challenging problem which modern-day rocket punditry is mostly overlooking. We’re worried that if you aim a rocket at where the Moon is in the sky, and press the launch button, the rocket may not actually end up at the Moon.

ALFONSO:  I understand that it’s very important to design fins that can stabilize a spaceplane’s flight in heavy winds. That’s important spaceplane safety research and someone needs to do it.

But if you were working on that sort of safety research, I’d expect you to be collaborating tightly with modern airplane engineers to test out your fin designs, to demonstrate that they are actually useful.

BETH:  Aerodynamic designs are important features of any safe rocket, and we’re quite glad that rocket scientists are working on these problems and taking safety seriously. That’s not the sort of problem that we at MIRI focus on, though.

ALFONSO:  What’s the concern, then? Do you fear that spaceplanes may be developed by ill-intentioned people?

BETH:  That’s not the failure mode we’re worried about right now. We’re more worried that right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination. Whether Google or the US Government or North Korea is the one to launch the rocket won’t make a pragmatic difference to the probability of a successful Moon landing from our perspective, because right now nobody knows how to aim any kind of rocket anywhere.