Subsystem Alignment

 |   |  Analysis

Emmy the embedded agent


You want to figure something out, but you don’t know how to do that yet.

You have to somehow break up the task into sub-computations. There is no atomic act of “thinking”; intelligence must be built up of non-intelligent parts.

The agent being made of parts is part of what made counterfactuals hard, since the agent may have to reason about impossible configurations of those parts.

Being made of parts is what makes self-reasoning and self-modification even possible.

What we’re primarily going to discuss in this section, though, is another problem: when the agent is made of parts, there could be adversaries not just in the external environment, but inside the agent as well.

This cluster of problems is Subsystem Alignment: ensuring that subsystems are not working at cross purposes; avoiding subprocesses optimizing for unintended goals.


  • benign induction
  • benign optimization
  • transparency
  • inner optimizers


Here’s a straw agent design:


A straw agent with epistemic and instrumental subsystems


The epistemic subsystem just wants accurate beliefs. The instrumental subsystem uses those beliefs to track how well it is doing. If the instrumental subsystem gets too capable relative to the epistemic subsystem, it may decide to try to fool the epistemic subsystem, as depicted.

If the epistemic subsystem gets too strong, that could also possibly yield bad outcomes.

This agent design treats the system’s epistemic and instrumental subsystems as discrete agents with goals of their own, which is not particularly realistic. However, we saw in the section on wireheading that the problem of subsystems working at cross purposes is hard to avoid. And this is a harder problem if we didn’t intentionally build the relevant subsystems.

One reason to avoid booting up sub-agents who want different things is that we want robustness to relative scale.

An approach is robust to scale if it still works, or fails gracefully, as you scale capabilities. There are three types: robustness to scaling up; robustness to scaling down; and robustness to relative scale.


  • Robustness to scaling up means that your system doesn’t stop behaving well if it gets better at optimizing. One way to check this is to think about what would happen if the function the AI optimizes were actually maximized. Think Goodhart’s Law.


  • Robustness to scaling down means that your system still works if made less powerful. Of course, it may stop being useful; but it should fail safely and without unnecessary costs.


    Your system might work if it can exactly maximize some function, but is it safe if you approximate? For example, maybe a system is safe if it can learn human values very precisely, but approximation makes it increasingly misaligned.


  • Robustness to relative scale means that your design does not rely on the agent’s subsystems being similarly powerful. For example, GAN (Generative Adversarial Network) training can fail if one sub-network gets too strong, because there’s no longer any training signal.

GAN training

Lack of robustness to scale isn’t necessarily something which kills a proposal, but it is something to be aware of; lacking robustness to scale, you need strong reason to think you’re at the right scale.

Robustness to relative scale is particularly important for subsystem alignment. An agent with intelligent sub-parts should not rely on being able to outsmart them, unless we have a strong account of why this is always possible.

The big-picture moral: aim to have a unified system that doesn’t work at cross purposes to itself.

Why would anyone make an agent with parts fighting against one another? There are three obvious reasons: subgoals, pointers, and search.

Splitting up a task into subgoals may be the only way to efficiently find a solution. However, a subgoal computation shouldn’t completely forget the big picture!

An agent designed to build houses should not boot up a sub-agent who cares only about building stairs.

One intuitive desideratum is that although subsystems need to have their own goals in order to decompose problems into parts, the subgoals need to “point back” robustly to the main goal.

A house-building agent might spin up a subsystem that cares only about stairs, but only cares about stairs in the context of houses.

However, you need to do this in a way that doesn’t just amount to your house-building system having a second house-building system inside its head. This brings me to the next item:

Pointers: It may be difficult for subsystems to carry the whole-system goal around with them, since they need to be reducing the problem. However, this kind of indirection seems to encourage situations in which different subsystems’ incentives are misaligned.

As we saw in the example of the epistemic and instrumental subsystems, as soon as we start optimizing some sort of expectation, rather than directly getting feedback about what we’re doing on the metric that’s actually important, we may create perverse incentives—that’s Goodhart’s Law.

How do we ask a subsystem to “do X” as opposed to “convince the wider system that I’m doing X”, without passing along the entire overarching goal-system?

This is similar to the way we wanted successor agents to robustly point at values, since it is too hard to write values down. However, in this case, learning the values of the larger agent wouldn’t make any sense either; subsystems and subgoals need to be smaller.

It might not be that difficult to solve subsystem alignment for subsystems which humans entirely design, or subgoals which an AI explicitly spins up. If you know how to avoid misalignment by design and robustly delegate your goals, both problems seem solvable.

However, it doesn’t seem possible to design all subsystems so explicitly. At some point in solving a problem, you’ve split it up as much as you know how to and must rely on some trial and error.

This brings us to the third reason subsystems might be optimizing different things, search: solving a problem by looking through a rich space of possibilities, a space which may itself contain misaligned subsystems.

ML researchers are quite familiar with the phenomenon: it’s easier to write a program which finds a high-performance machine translation system for you than to directly write one yourself.

In the long run, this process can go one step further. For a rich enough problem and an impressive enough search process, the solutions found via search might themselves be intelligently optimizing something. This problem is described in Hubinger, et al.’s forthcoming “The Inner Alignment Problem”.

Let’s call the outer search process an “outer optimizer”, and the inner search process an “inner optimizer”.

“Optimization” and “search” are ambiguous terms. I’ll think of them as any algorithm which can be naturally interpreted as doing significant computational work to “find” an object that scores highly on some objective function.

The objective function of the outer optimizer is not necessarily the same as that of the inner optimizer. If the outer optimizer wants to make pizza, the inner optimizer may enjoy kneading dough, chopping ingredients, et cetera.

The inner objective function must be helpful for the outer, at least in the examples the outer optimizer is checking. Otherwise, the inner optimizer would not have been selected.

However, the inner optimizer must reduce the problem somehow; there is no point to it running the exact same search. So it seems like its objectives will tend to be like good heuristics; easier to optimize, but different from the outer objective in general.

Why might a difference between inner and outer objectives be concerning, if the inner optimizer is scoring highly on the outer objective anyway? It’s about the interplay with what’s really wanted. Even if we get value specification exactly right, there will always be some distributional shift between the training set and deployment. (See Amodei, et al.’s “Concrete Problems in AI Safety”.)

Distributional shifts which would be small in ordinary cases may make a big difference to a capable inner optimizer, which may observe the slight difference and figure out how to capitalize on it for its own objective.

Actually, to even use the term “distributional shift” seems wrong in the context of embedded agency. The world is not i.i.d. The analog of “no distributional shift” would be to have an exact model of the whole future relevant to what you want to optimize, and the ability to run it over and over during training. So we need to deal with massive “distributional shift”.

We may also want to optimize for things that aren’t exactly what we want. The obvious way to avoid agents that pursue subgoals at the cost of the overall goal is to have the subsystems not be agentic. Just search over a bunch of ways to make stairs, don’t make something that cares about stairs. The problem is then that powerful inner optimizers are optimizing something the outer system doesn’t care about, and that the inner optimizers will have a convergent incentive to be agentic.

Additionally, there’s the possibility that the inner optimizer becomes aware of the outer optimizer, in which case it might start explicitly trying to do well on the outer objective function in order to be kept around, while looking for any signs that it has left training and can stop pretending.

This is the same story we saw in adversarial Goodhart: there is something agentic in the search space, which responds to our choice of proxy in a way which makes our proxy a bad one.

If intelligent inner optimizers developing in deep neural network training seems too hypothetical, consider the evolution of life on Earth. Evolution can be thought of as a reproductive fitness maximizer.

(Evolution can actually be thought of as an optimizer for many things, or as no optimizer at all, but that doesn’t matter. The point is that if an agent wanted to maximize reproductive fitness, it might use a system that looked like evolution.)

Intelligent organisms are inner optimizers of evolution. Although the drives of intelligent organisms are certainly correlated with reproductive fitness, organisms want all sorts of things. There are even inner optimizers who have come to understand evolution, and even to manipulate it at times. Powerful and misaligned inner optimizers appear to be a real possibility, then, at least with enough processing power.

Problems seem to arise because you try to solve a problem which you don’t yet know how to solve by searching over a large space and hoping “someone” can solve it.

If the source of the issue is the solution of problems by massive search, perhaps we should look for different ways to solve problems. Perhaps we should solve problems by figuring things out. But how do you solve problems which you don’t yet know how to solve other than by trying things?

Let’s take a step back.

Embedded world-models is about how to think at all, as an embedded agent; decision theory is about how to act. Robust delegation is about building trustworthy successors and helpers. Subsystem alignment is about building one agent out of trustworthy parts.


Embedded agency


The problem is that:

  • We don’t know how to think about environments when we’re smaller.
  • To the extent we can do that, we don’t know how to think about consequences of actions in those environments.
  • Even when we can do that, we don’t know how to think about what we want.
  • Even when we have none of these problems, we don’t know how to reliably output actions which get us what we want!

This is the penultimate post in Scott Garrabrant and Abram Demski’s Embedded Agency sequence. Conclusion: embedded curiosities.