Embedded Curiosities

November 8, 2018 | Abram Demski | Analysis

This is the conclusion of the Embedded Agency series. Previous posts:

Embedded Agents — Decision Theory — Embedded World-Models
Robust Delegation — Subsystem Alignment

A final word on curiosity, and intellectual puzzles:

I described an embedded agent, Emmy, and said that I don’t understand how she evaluates her options, models the world, models herself, or decomposes and solves problems.

In the past, when researchers have talked about motivations for working on problems like these, they’ve generally focused on the motivation from AI risk. AI researchers want to build machines that can solve problems in the general-purpose fashion of a human, and dualism is not a realistic framework for thinking about such systems. In particular, it’s an approximation that’s especially prone to breaking down as AI systems get smarter. When people figure out how to build general AI systems, we want those researchers to be in a better position to understand their systems, analyze their internal properties, and be confident in their future behavior.

This is the motivation for most researchers today who are working on things like updateless decision theory and subsystem alignment. We care about basic conceptual puzzles which we think we need to figure out in order to achieve confidence in future AI systems, and not have to rely quite so much on brute-force search or trial and error.

But the arguments for why we may or may not need particular conceptual insights in AI are pretty long. I haven’t tried to wade into the details of that debate here. Instead, I’ve been discussing a particular set of research directions as an intellectual puzzle, and not as an instrumental strategy.

One downside of discussing these problems as instrumental strategies is that it can lead to some misunderstandings about why we think this kind of work is so important. With the “instrumental strategies” lens, it’s tempting to draw a direct line from a given research problem to a given safety concern. But it’s not that I’m imagining real-world embedded systems being “too Bayesian” and this somehow causing problems, if we don’t figure out what’s wrong with current models of rational agency. It’s certainly not that I’m imagining future AI systems being written in second-order logic! In most cases, I’m not trying at all to draw direct lines between research problems and specific AI failure modes.

What I’m instead thinking about is this: We sure do seem to be working with the wrong basic concepts today when we try to think about what agency is, as seen by the fact that these concepts don’t transfer well to the more realistic embedded framework.

If AI developers in the future are still working with these confused and incomplete basic concepts as they try to actually build powerful real-world optimizers, that seems like a bad position to be in. And it seems like the research community is unlikely to figure most of this out by default in the course of just trying to develop more capable systems. Evolution certainly figured out how to build human brains without “understanding” any of this, via brute-force search.

Embedded agency is my way of trying to point at what I think is a very important and central place where I feel confused, and where I think future researchers risk running into confusions too.

There’s also a lot of excellent AI alignment research that’s being done with an eye toward more direct applications; but I think of that safety research as having a different type signature than the puzzles I’ve talked about here.

Intellectual curiosity isn’t the ultimate reason we privilege these research directions. But there are some practical advantages to orienting toward research questions from a place of curiosity at times, as opposed to only applying the “practical impact” lens to how we think about the world.

When we apply the curiosity lens to the world, we orient toward the sources of confusion preventing us from seeing clearly; the blank spots in our map, the flaws in our lens. It encourages re-checking assumptions and attending to blind spots, which is helpful as a psychological counterpoint to our “instrumental strategy” lens—the latter being more vulnerable to the urge to lean on whatever shaky premises we have on hand so we can get to more solidity and closure in our early thinking.

Embedded agency is an organizing theme behind most, if not all, of our big curiosities. It seems like a central mystery underlying many concrete difficulties.

Subsystem Alignment

November 6, 2018 | Scott Garrabrant | Analysis

You want to figure something out, but you don’t know how to do that yet.

You have to somehow break up the task into sub-computations. There is no atomic act of “thinking”; intelligence must be built up of non-intelligent parts.

The agent being made of parts is part of what made counterfactuals hard, since the agent may have to reason about impossible configurations of those parts.

Being made of parts is what makes self-reasoning and self-modification even possible.

What we’re primarily going to discuss in this section, though, is another problem: when the agent is made of parts, there could be adversaries not just in the external environment, but inside the agent as well.

This cluster of problems is Subsystem Alignment: ensuring that subsystems are not working at cross purposes; avoiding subprocesses optimizing for unintended goals.

benign induction

benign optimization

transparency

mesa-optimizers

Robust Delegation

November 4, 2018 | Abram Demski | Analysis

Self-improvement

Because the world is big, the agent as it is may be inadequate to accomplish its goals, including in its ability to think.

Because the agent is made of parts, it can improve itself and become more capable.

Improvements can take many forms: The agent can make tools, the agent can make successor agents, or the agent can just learn and grow over time. However, the successors or tools need to be more capable for this to be worthwhile.

This gives rise to a special type of principal/agent problem:

You have an initial agent, and a successor agent. The initial agent gets to decide exactly what the successor agent looks like. The successor agent, however, is much more intelligent and powerful than the initial agent. We want to know how to have the successor agent robustly optimize the initial agent’s goals.

Here are three examples of forms this principal/agent problem can take:

Three principal-agent problems in robust delegation

In the AI alignment problem, a human is trying to build an AI system which can be trusted to help with the human’s goals.

In the tiling agents problem, an agent is trying to make sure it can trust its future selves to help with its own goals.

Or we can consider a harder version of the tiling problem—stable self-improvement—where an AI system has to build a successor which is more intelligent than itself, while still being trustworthy and helpful.

For a human analogy which involves no AI, you can think about the problem of succession in royalty, or more generally the problem of setting up organizations to achieve desired goals without losing sight of their purpose over time.

The difficulty seems to be twofold:

First, a human or AI agent may not fully understand itself and its own goals. If an agent can’t write out what it wants in exact detail, that makes it hard for it to guarantee that its successor will robustly help with the goal.

Second, the idea behind delegating work is that you not have to do all the work yourself. You want the successor to be able to act with some degree of autonomy, including learning new things that you don’t know, and wielding new skills and capabilities.

In the limit, a really good formal account of robust delegation should be able to handle arbitrarily capable successors without throwing up any errors—like a human or AI building an unbelievably smart AI, or like an agent that just keeps learning and growing for so many years that it ends up much smarter than its past self.

The problem is not (just) that the successor agent might be malicious. The problem is that we don’t even know what it means not to be.

This problem seems hard from both points of view.

Successors

The initial agent needs to figure out how reliable and trustworthy something more powerful than it is, which seems very hard. But the successor agent has to figure out what to do in situations that the initial agent can’t even understand, and try to respect the goals of something that the successor can see is inconsistent, which also seems very hard.

At first, this may look like a less fundamental problem than “make decisions” or “have models”. But the view on which there are multiple forms of the “build a successor” problem is itself a dualistic view.

To an embedded agent, the future self is not privileged; it is just another part of the environment. There isn’t a deep difference between building a successor that shares your goals, and just making sure your own goals stay the same over time.

So, although I talk about “initial” and “successor” agents, remember that this isn’t just about the narrow problem humans currently face of aiming a successor. This is about the fundamental problem of being an agent that persists and learns over time.

We call this cluster of problems Robust Delegation. Examples include:

Vingean reflection

the tiling problem

averting Goodhart’s law

value loading

corrigibility

informed oversight

Embedded World-Models

November 2, 2018 | Scott Garrabrant | Analysis

An agent which is larger than its environment can:

Hold an exact model of the environment in its head.
Think through the consequences of every potential course of action.
If it doesn’t know the environment perfectly, hold every possible way the environment could be in its head, as is the case with Bayesian uncertainty.

All of these are typical of notions of rational agency.

An embedded agent can’t do any of those things, at least not in any straightforward way.

Emmy the embedded agent

One difficulty is that, since the agent is part of the environment, modeling the environment in every detail would require the agent to model itself in every detail, which would require the agent’s self-model to be as “big” as the whole agent. An agent can’t fit inside its own head.

The lack of a crisp agent/environment boundary forces us to grapple with paradoxes of self-reference. As if representing the rest of the world weren’t already hard enough.

Embedded World-Models have to represent the world in a way more appropriate for embedded agents. Problems in this cluster include:

the “realizability” / “grain of truth” problem: the real world isn’t in the agent’s hypothesis space

logical uncertainty

high-level models

multi-level models

ontological crises

naturalized induction, the problem that the agent must incorporate its model of itself into its world-model

anthropic reasoning, the problem of reasoning with how many copies of yourself exist

Decision Theory

October 31, 2018 | Abram Demski | Analysis

Decision theory and artificial intelligence typically try to compute something resembling

$$\underset{a \ \in \ Actions}{\mathrm{argmax}} \ \ f(a).$$

I.e., maximize some function of the action. This tends to assume that we can detangle things enough to see outcomes as a function of actions.

For example, AIXI represents the agent and the environment as separate units which interact over time through clearly defined i/o channels, so that it can then choose actions maximizing reward.

AIXI

When the agent model is a part of the environment model, it can be significantly less clear how to consider taking alternative actions.

Embedded agent

For example, because the agent is smaller than the environment, there can be other copies of the agent, or things very similar to the agent. This leads to contentious decision-theory problems such as the Twin Prisoner’s Dilemma and Newcomb’s problem.

If Emmy Model 1 and Emmy Model 2 have had the same experiences and are running the same source code, should Emmy Model 1 act like her decisions are steering both robots at once? Depending on how you draw the boundary around “yourself”, you might think you control the action of both copies, or only your own.

This is an instance of the problem of counterfactual reasoning: how do we evaluate hypotheticals like “What if the sun suddenly went out”?

Problems of adapting decision theory to embedded agents include:

counterfactuals

Newcomblike reasoning, in which the agent interacts with copies of itself

reasoning about other agents more broadly

extortion problems

coordination problems

logical counterfactuals

logical updatelessness

October 2018 Newsletter

October 29, 2018 | Rob Bensinger | Newsletters

Announcing the new AI Alignment Forum

October 29, 2018 | Guest | Guest Posts, News

This is a guest post by Oliver Habryka, lead developer for LessWrong. Our gratitude to the LessWrong team for the hard work they’ve put into developing this resource, and our congratulations on today’s launch!

I am happy to announce that after two months of open beta, the AI Alignment Forum is launching today. The AI Alignment Forum is a new website built by the team behind LessWrong 2.0, to help create a new hub for technical AI alignment research and discussion.

One of our core goals when we designed the forum was to make it easier for new people to get started on doing technical AI alignment research. This effort was split into two major parts:

Embedded Agents

October 29, 2018 | Scott Garrabrant | Analysis

Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don’t already know.¹

There’s a complicated engineering problem here. But there’s also a problem of figuring out what it even means to build a learning agent like that. What is it to optimize realistic goals in physical environments? In broad terms, how does it work?

In this series of posts, I’ll point to four ways we don’t currently know how it works, and four areas of active research aimed at figuring it out.

This is Alexei, and Alexei is playing a video game.

Alexei the dualistic agent

Like most games, this game has clear input and output channels. Alexei only observes the game through the computer screen, and only manipulates the game through the controller.

The game can be thought of as a function which takes in a sequence of button presses and outputs a sequence of pixels on the screen.

Alexei is also very smart, and capable of holding the entire video game inside his mind. If Alexei has any uncertainty, it is only over empirical facts like what game he is playing, and not over logical facts like which inputs (for a given deterministic game) will yield which outputs. This means that Alexei must also store inside his mind every possible game he could be playing.

Alexei does not, however, have to think about himself. He is only optimizing the game he is playing, and not optimizing the brain he is using to think about the game. He may still choose actions based off of value of information, but this is only to help him rule out possible games he is playing, and not to change the way in which he thinks.

In fact, Alexei can treat himself as an unchanging indivisible atom. Since he doesn’t exist in the environment he’s thinking about, Alexei doesn’t worry about whether he’ll change over time, or about any subroutines he might have to run.

Notice that all the properties I talked about are partially made possible by the fact that Alexei is cleanly separated from the environment that he is optimizing.
Read more »

This is part 1 of the Embedded Agency series, by Abram Demski and Scott Garrabrant. ↩

Embedded Curiosities

Subsystem Alignment

Robust Delegation

Embedded World-Models

Decision Theory

October 2018 Newsletter

Announcing the new AI Alignment Forum

Embedded Agents

Search

Browse

Subscribe

Other updates

News and links

Search

Browse

Subscribe