Decision theory and artificial intelligence typically try to compute something resembling

$$\underset{a \ \in \ Actions}{\mathrm{argmax}} \ \ f(a).$$

I.e., maximize some function of the action. This tends to assume that we can detangle things enough to see outcomes as a function of actions.

For example, AIXI represents the agent and the environment as separate units which interact over time through clearly defined i/o channels, so that it can then choose actions maximizing reward.

When the agent model is a part of the environment model, it can be significantly less clear how to consider taking alternative actions.

For example, because the agent is smaller than the environment, there can be other copies of the agent, or things very similar to the agent. This leads to contentious decision-theory problems such as the Twin Prisoner’s Dilemma and Newcomb’s problem.

If Emmy Model 1 and Emmy Model 2 have had the same experiences and are running the same source code, should Emmy Model 1 act like her decisions are steering both robots at once? Depending on how you draw the boundary around “yourself”, you might think you control the action of both copies, or only your own.

Problems of adapting **decision theory** to embedded agents include:

- counterfactuals
- Newcomblike reasoning, in which the agent interacts with copies of itself
- reasoning about other agents more broadly
- extortion problems
- coordination problems
- logical counterfactuals
- logical updatelessness

The difficulty with counterfactuals can be illustrated by the five-and-ten problem. Suppose we have the option of taking a five dollar bill or a ten dollar bill, and all we care about in the situation is how much money we get. Obviously, we should take the $10.

However, it is not so easy as it seems to reliably take the $10 when the agent knows its own behavior. If you reason about yourself as just another part of the environment, then you can know your own action. If you can know your own action, then it becomes difficult to reason about what would happen if you took different actions. This means an agent can stably take the $5 because it believes “If I take the $10, I get $0”!

This error is coming from a confusion where we replace the intuitive counterfactual “if” with logical implication. This may seem like a silly confusion, but there is not much else we can do, because we don’t know how to formalize the counterfactual “if” correctly.

We could instead try to use probability to formalize counterfactuals, but this won’t work either. If we try to calculate the expected utility of our actions by Bayesian conditioning, as is common, knowing our own behavior leads to a divide-by-zero error when we try to calculate the expected utility of actions we know we don’t take: \(\lnot A\) implies \(P(A)=0\), which implies \(P(B \& A)=0\), which implies

$$P(B|A) = \frac{P(B \& A)}{P(A)} = \frac{0}{0}.$$

Because the agent doesn’t know how to separate itself from the environment, it gets gnashing internal gears when it tries to imagine taking different actions.

This is an instance of the problem of **counterfactual reasoning**: how do we evaluate hypotheticals like “What if the sun suddenly went out”?

The most central example of why agents need to think about counterfactuals comes from counterfactuals about their own actions.

This is especially tricky if you already know what you’re going to do, the same way “what if the sun suddenly went out” is especially tricky if you know that it won’t, or “what if 2+2=3” is especially tricky if you know 2+2=4. When the agent is part of the environment, it becomes difficult to distinguish reasoning about yourself from reasoning in general, so you run the risk of knowing your own action.

Why might an agent come to know its own action before it has acted?

Perhaps the agent is trying to plan ahead, or reason about a game-theoretic situation in which its action has an intricate role to play.

But the biggest complication comes from Löb’s Theorem. This can be illustrated more clearly by looking at the behavior of simple logic-based agents reasoning about the five-and-ten problem.

Consider this example:

We have the source code for an agent and the universe. They can refer to each other through the use of quining. The universe is simple; the universe just outputs whatever the agent outputs.

The agent spends a long time searching for proofs about what happens if it takes various actions. If for some \(x\) and \(y\) equal to \(0\), \(5\), or \(10\), it finds a proof that taking the \(5\) leads to \(x\) utility, that taking the \(10\) leads to \(y\) utility, and that \(x>y\), it will naturally take the \(5\). We expect that it won’t find such a proof, and will instead pick the default action of taking the \(10\).

It seems easy when you just imagine an agent trying to reason about the universe. Yet it turns out that if the amount of time spent searching for proofs is enough, the agent will always choose \(5\)!

The proof that this is so is by Löb’s theorem. Löb’s theorem says that, for any proposition \(P\), if you can prove that a *proof* of \(P\) would imply the *truth* of \(P\), then you can prove \(P\). In symbols, with

“\(□X\)” meaning “\(X\) is provable”:

$$□(□P \to P) \to □P.$$

In the version of the five-and-ten problem I gave, “\(P\)” is the proposition “if the agent outputs \(5\) the universe outputs \(5\), and if the agent outputs \(10\) the universe outputs \(0\)”.

Supposing it is provable, the agent will eventually find the proof, and return \(5\) in fact. This makes the sentence *true*, since the agent outputs \(5\) and the universe outputs \(5\), and since it’s false that the agent outputs \(10\). This is because false propositions like “the agent outputs \(10\)” imply everything, *including* the universe outputting \(5\).

The agent can (given enough time) prove all of this, in which case the agent in fact proves the proposition “if the agent outputs \(5\) the universe outputs \(5\), and if the agent outputs \(10\) the universe outputs \(0\)”. And as a result, the agent takes the $5.

We call this a “spurious proof”: the agent takes the $5 because it can prove that *if* it takes the $10 it has low value, *because* it takes the $5. More generally, when working in less proof-based settings, we refer to this as a problem of spurious counterfactuals.

The general pattern is: counterfactuals may spuriously mark an action as not being very good. This makes the AI not take the action. Depending on how the counterfactuals work, this may remove any feedback which would “correct” the problematic counterfactual; or, as we saw with proof-based reasoning, it may actively help the spurious counterfactual be “true”.

Note that because the proof-based examples are of significant interest to us, “counterfactuals” actually have to be **counter logicals**; we sometimes need to reason about logically impossible “possibilities”. This rules out most existing accounts of counterfactual reasoning.

You may have noticed that I slightly cheated. The only thing that broke the symmetry and caused the agent to take the $5 was the fact that “\(5\)” was the action that was taken when a proof was found, and “\(10\)” was the default. We could instead consider an agent that looks for any proof at all about what actions lead to what utilities, and then takes the action that is better. This way, which action is taken is dependent on what order we search for proofs.

Let’s assume we search for short proofs first. In this case, we will take the $10, since it is very easy to show that \(A()=5\) leads to \(U()=5\) and \(A()=10\) leads to \(U()=10\).

The problem is that spurious proofs can be short too, and don’t get much longer when the universe gets harder to predict. If we replace the universe with one that is provably functionally the same, but is harder to predict, the shortest proof will short-circuit the complicated universe and be spurious.

People often try to solve the problem of counterfactuals by suggesting that there will always be some uncertainty. An AI may know its source code perfectly, but it can’t perfectly know the hardware it is running on.

Does adding a little uncertainty solve the problem? Often not:

- The proof of the spurious counterfactual often still goes through; if you think you are in a five-and-ten problem with a 95% certainty, you can have the usual problem within that 95%.
- Adding uncertainty to make counterfactuals well-defined doesn’t get you any guarantee that the counterfactuals will be
*reasonable*. Hardware failures aren’t often what you want to expect when considering alternate actions.

Consider this scenario: You are confident that you almost always take the left path. However, it is possible (though unlikely) for a cosmic ray to damage your circuits, in which case you could go right—but you would then be insane, which would have many other bad consequences.

If *this reasoning in itself* is why you always go left, you’ve gone wrong.

So I’m not talking about agents who know their own actions because I think there’s going to be a big problem with intelligent machines inferring their own actions in the future. Rather, the possibility of knowing your own actions illustrates something confusing about determining the consequences of your actions—a confusion which shows up even in the very simple case where everything about the world is known and you just need to choose the larger pile of money.

Maybe we can force exploration actions, so that we learn what happens when we do things? This proposal runs into two problems:

- A bad prior can think that exploring is dangerous.
- Forcing it to take exploratory actions doesn’t teach it what the world would look like if it took those actions deliberately.

But writing down examples of “correct” counterfactual reasoning doesn’t seem hard from the outside!

Maybe that’s because from “outside” we always have a dualistic perspective. We are in fact sitting outside of the problem, and we’ve defined it as a function of an agent.

However, an agent can’t solve the problem in the same way from inside. From its perspective, its functional relationship with the environment isn’t an observable fact. This is why “counterfactuals” are called what they are called, after all.

When I told you about the 5 and 10 problem, I first told you about the problem, and then gave you an agent. When one agent doesn’t work well, we could consider a different agent.

Finding a way to succeed at a decision problem involves finding an agent that when plugged into the problem takes the right action. The fact that we can even consider putting in different agents means that we have already carved the universe into an “agent” part, plus the rest of the universe with a hole for the agent—which is most of the work!

Are we just fooling ourselves due to the way we set up decision problems, then? Are there no “correct” counterfactuals?

Well, maybe we *are* fooling ourselves. But there is still something we are confused about! “Counterfactuals are subjective, invented by the agent” doesn’t dissolve the mystery. There is *something* intelligent agents do, in the real world, to make decisions.

**Updateless decision theory** (UDT) views the problem from “closer to the outside”. It does this by picking the action which the agent would have wanted to commit to *before* getting into the situation.

Consider the following game: Alice receives a card at random which is either High or Low. She may reveal the card if she wishes. Bob then gives his probability \(p\) that Alice has a high card. Alice always loses \(p^2\) dollars. Bob loses \(p^2\) if the card is low, and \((1-p)^2\) if the card is high.

Bob has a proper scoring rule, so does best by giving his true belief. Alice just wants Bob’s belief to be as much toward “low” as possible.

Suppose Alice will play only this one time. She sees a low card. Bob is good at reasoning about Alice, but is in the next room and so can’t read any tells. Should Alice reveal her card?

Since Alice’s card is low, if she shows it to Bob, she will lose no money, which is the best possible outcome. However, this means that in the counterfactual world where Alice sees a high card, she wouldn’t be able to keep the secret—she might as well show her card in that case too, since her reluctance to show it would be as reliable a sign of “high”.

On the other hand, if Alice doesn’t show her card, she loses 25¢—but then she can use the same strategy in the other world, rather than losing $1. So, before playing the game, Alice would want to visibly commit to not reveal; this makes expected loss 25¢, whereas the other strategy has expected loss 50¢.

This game is equivalent to the decision problem called counterfactual mugging. UDT solves such problems by recommending that the agent do whatever would have seemed wisest before—whatever your earlier self would have committed to do.

UDT is an elegant solution to a fairly broad class of decision problems. However, it only makes sense if the earlier self can foresee all possible situations.

This works fine in a Bayesian setting where the prior already contains all possibilities within itself. However, there may be no way to do this in a realistic embedded setting. An agent has to be able to think of *new possibilities*—meaning that its earlier self doesn’t know enough to make all the decisions.

And with that, we find ourselves squarely facing the problem of **embedded world-models**.

This is part of Abram Demski and Scott Garrabrant’s Embedded Agency sequence. **Continued here!**

**Did you like this post?** You may enjoy our other Analysis posts, including: