Because the world is big, the agent as it is may be inadequate to accomplish its goals, including in its ability to think.
Because the agent is made of parts, it can improve itself and become more capable.
Improvements can take many forms: The agent can make tools, the agent can make successor agents, or the agent can just learn and grow over time. However, the successors or tools need to be more capable for this to be worthwhile.
This gives rise to a special type of principal/agent problem:
You have an initial agent, and a successor agent. The initial agent gets to decide exactly what the successor agent looks like. The successor agent, however, is much more intelligent and powerful than the initial agent. We want to know how to have the successor agent robustly optimize the initial agent’s goals.
The problem is not (just) that the successor agent might be malicious. The problem is that we don’t even know what it means not to be.
This problem seems hard from both points of view.
The initial agent needs to figure out how reliable and trustworthy something more powerful than it is, which seems very hard. But the successor agent has to figure out what to do in situations that the initial agent can’t even understand, and try to respect the goals of something that the successor can see is inconsistent, which also seems very hard.
To an embedded agent, the future self is not privileged; it is just another part of the environment. There isn’t a deep difference between building a successor that shares your goals, and just making sure your own goals stay the same over time.
So, although I talk about “initial” and “successor” agents, remember that this isn’t just about the narrow problem humans currently face of aiming a successor. This is about the fundamental problem of being an agent that persists and learns over time.
We call this cluster of problems Robust Delegation. Examples include:
Imagine you are playing the CIRL game with a toddler.
CIRL means Cooperative Inverse Reinforcement Learning. The idea behind CIRL is to define what it means for a robot to collaborate with a human. The robot tries to pick helpful actions, while simultaneously trying to figure out what the human wants.
Usually, we think about this from the point of view of the human. But now consider the problem faced by the robot, where they’re trying to help someone who is very confused about the universe. Imagine trying to help a toddler optimize their goals.
- From your standpoint, the toddler may be too irrational to be seen as optimizing anything.
- The toddler may have an ontology in which it is optimizing something, but you can see that ontology doesn’t make sense.
- Maybe you notice that if you set up questions in the right way, you can make the toddler seem to want almost anything.
Part of the problem is that the “helping” agent has to be bigger in some sense in order to be more capable; but this seems to imply that the “helped” agent can’t be a very good supervisor for the “helper”.
For example, updateless decision theory eliminates dynamic inconsistencies in decision theory by, rather than maximizing expected utility of your action given what you know, maximizing expected utility of reactions to observations, from a state of ignorance.
Appealing as this may be as a way to achieve reflective consistency, it creates a strange situation in terms of computational complexity: If actions are type \(A\), and observations are type \(O\), reactions to observations are type \(O \to A\)—a much larger space to optimize over than \(A\) alone. And we’re expecting our smaller self to be able to do that!
This seems bad.
One way to more crisply state the problem is: We should be able to trust that our future self is applying its intelligence to the pursuit of our goals without being able to predict precisely what our future self will do. This criterion is called Vingean reflection.
For example, you might plan your driving route before visiting a new city, but you do not plan your steps. You plan to some level of detail, and trust that your future self can figure out the rest.
Vingean reflection is difficult to examine via classical Bayesian decision theory because Bayesian decision theory assumes logical omniscience. Given logical omniscience, the assumption “the agent knows its future actions are rational” is synonymous with the assumption “the agent knows its future self will act according to one particular optimal policy which the agent can predict in advance”.
We have some limited models of Vingean reflection (see “Tiling Agents for Self-Modifying AI, and the Löbian Obstacle” by Yudkowsky and Herreshoff). A successful approach must walk the narrow line between two problems:
- The Löbian Obstacle: Agents who trust their future self because they trust the output of their own reasoning are inconsistent.
- The Procrastination Paradox: Agents who trust their future selves without reason tend to be consistent but unsound and untrustworthy, and will put off tasks forever because they can do them later.
The Vingean reflection results so far apply only to limited sorts of decision procedures, such as satisficers aiming for a threshold of acceptability. So there is plenty of room for improvement, getting tiling results for more useful decision procedures and under weaker assumptions.
When you construct another agent, rather than delegating to your future self, you more directly face a problem of value loading.
The main problems here:
- We don’t know what we want.
- Optimization amplifies slight differences between what we say we want and what we really want.
The misspecification-amplifying effect is known as Goodhart’s Law, named for Charles Goodhart’s observation: “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”
When we specify a target for optimization, it is reasonable to expect it to be correlated with what we want—highly correlated, in some cases. Unfortunately, however, this does not mean that optimizing it will get us closer to what we want—especially at high levels of optimization.
There are (at least) four types of Goodhart: regressional, causal, extremal, and adversarial.
Regressional Goodhart happens when there is a less than perfect correlation between the proxy and the goal. It is more commonly known as the optimizer’s curse, and it is related to regression to the mean.
An unbiased estimate of \(Y\) given \(X\) is not an unbiased estimate of \(Y\) when we select for the best \(X\). In that sense, we can expect to be disappointed when we use \(X\) as a proxy for \(Y\) for optimization purposes.
Using a Bayes estimate instead of an unbiased estimate, we can eliminate this sort of predictable disappointment.
This doesn’t necessarily allow us to get a better \(Y\) value, since we still only have the information content of \(X\) to work with. However, it sometimes may. If \(Y\) is normally distributed with variance \(1\), and \(X\) is \(Y \pm 10\) with even odds of \(+\) or \(-\), a Bayes estimate will give better optimization results by almost entirely removing the noise.
Causal Goodhart happens when you observe a correlation between proxy and goal, but when you intervene to increase the proxy, you fail to increase the goal because the observed correlation was not causal in the right way. Teasing correlation apart from causation is run-of-the-mill counterfactual reasoning.
In extremal Goodhart, optimization pushes you outside the range where the correlation exists, into portions of the distribution which behave very differently. This is especially scary because it tends to have phase shifts. You might not be able to observe the proxy breaking down at all when you have weak optimization, but once the optimization becomes strong enough, you can enter a very different domain.
Extremal Goodhart is similar to regressional Goodhart, but we can’t correct it with Bayes estimators if we don’t have the right model—otherwise, there seems to be no reason why the Bayes estimator itself should not be susceptible to extremal Goodhart.
If you have a probability distribution \(Q(y)\) such that the proxy \(X\) is only a boundedly bad approximation of \(Y\) on average, quantilization avoids extremal Goodhart by selecting randomly from \(Q(y|x \geq c)\) for some threshold \(c\). If we pick a threshold that is high but not extreme, we can hope that the risk of selecting from outliers with very different behavior will be small, and that \(Y\) is likely to be large.
This is helpful, but unlike Bayes estimators for regressional Goodhart, doesn’t necessarily seem like the end of the story. Maybe we can do better.
Finally, there is adversarial Goodhart, in which agents actively make our proxy worse by intelligently manipulating it. This is even harder to observe at low levels of optimization, both because the adversaries won’t want to start manipulating until after test time is over, and because adversaries that come from the system’s own optimization won’t show up until the optimization is powerful enough.
These different types of Goodhart effects work in very different ways, and, roughly speaking, they tend to start appearing at successively higher levels of optimization power—so be careful not to think you’ve conquered Goodhart’s law because you’ve solved some of them.
Unfortunately, this is hard; so can the AI system we’re building help us with this? More generally, can a successor agent help its predecessor solve this? Maybe it can use its intellectual advantages to figure out what we want?
AIXI learns what to do through a reward signal which it gets from the environment. We can imagine humans have a button which they press when AIXI does something they like.
The problem with this is that AIXI will apply its intelligence to the problem of taking control of the reward button. This is the problem of wireheading.
Maybe we build the reward button into the agent, as a black box which issues rewards based on what is going on. The box could be an intelligent sub-agent in its own right, which figures out what rewards humans would want to give. The box could even defend itself by issuing punishments for actions aimed at modifying the box.
In the end, though, if the agent understands the situation, it will be motivated to take control anyway.
There is a critical distinction between optimizing “\(U()\)” in quotation marks and optimizing \(U()\) directly. If the agent is coming up with plans to try to achieve a high output of the box, and it incorporates into its planning its uncertainty regarding the output of the box, then it will want to hack the box. However, if you run the expected outcomes of plans through the actual box, then plans to hack the box are evaluated by the current box, so they don’t look particularly appealing.
Daniel Dewey calls the second sort of agent an observation-utility maximizer. (Others have included observation-utility agents within a more general notion of reinforcement learning.)
I find it very interesting how you can try all sorts of things to stop an RL agent from wireheading, but the agent keeps working against it. Then, you make the shift to observation-utility agents and the problem vanishes.
It seems like the indirection itself is the problem. RL agents maximize the output of the box; observation-utility agents maximize \(U()\). So the challenge is to create stable pointers to what we value: a notion of “indirection” which serves to point at values not directly available to be optimized.
Observation-utility agents solve the classic wireheading problem, but we still have the problem of specifying \(U()\). So we add a level of indirection back in: we represent our uncertainty over \(U()\), and try to learn. Daniel Dewey doesn’t provide any suggestion for how to do this, but CIRL is one example.
Unfortunately, the wireheading problem can come back in even worse fashion. For example, if there is a drug which modifies human preferences to only care about using the drug, a CIRL agent could be highly motivated to give humans that drug in order to make its job easier. This is called the human manipulation problem.
The lesson I want to draw from this is from “Reinforcement Learning with a Corrupted Reward Channel” (by Tom Everitt et al.): the way you set up the feedback loop makes a huge difference.
They draw the following picture:
- In Standard RL, the feedback about the value of a state comes from the state itself, so corrupt states can be “self-aggrandizing”.
- In Decoupled RL, the feedback about the quality of a state comes from some other state, making it possible to learn correct values even when some feedback is corrupt.
In some sense, the challenge is to put the original, small agent in the feedback loop in the right way. However, the problems with updateless reasoning mentioned earlier make this hard; the original agent doesn’t know enough.
One way to try to address this is through intelligence amplification: try to turn the original agent into a more capable one with the same values, rather than creating a successor agent from scratch and trying to get value loading right.
For example, Paul Christiano proposes an approach in which the small agent is simulated many times in a large tree, which can perform complex computations by splitting problems into parts.
However, this is still fairly demanding for the small agent: it doesn’t just need to know how to break problems down into more tractable pieces; it also needs to know how to do so without giving rise to malign subcomputations.
For example, since the small agent can use the copies of itself to get a lot of computational power, it could easily try to use a brute-force search for solutions that ends up running afoul of Goodhart’s Law.
This issue is the subject of the next section: subsystem alignment.