Subsystem Alignment

  You want to figure something out, but you don't know how to do that yet. You have to somehow break up the task into sub-computations. There is no atomic act of "thinking"; intelligence must be built up of non-intelligent parts. The agent being made of parts is part of what made counterfactuals hard, since…

Embedded World-Models

  An agent which is larger than its environment can:   Hold an exact model of the environment in its head. Think through the consequences of every potential course of action. If it doesn't know the environment perfectly, hold every possible way the environment could be in its head, as is the case with Bayesian…

Embedded Agents

  Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don't already know.1 There's a complicated engineering problem here. But there's also a problem of figuring out what it even means to…

New paper: “Categorizing variants of Goodhart’s Law”

Goodhart's Law states that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." However, this is not a single phenomenon. In Goodhart Taxonomy, I proposed that there are (at least) four different mechanisms through which proxy measures break when you optimize for them: Regressional, Extremal, Causal, and…