Scott Garrabrant is taking over Nate Soares’ job of making predictions about how much progress we’ll make in different research areas this year. Scott divides MIRI’s alignment research into five categories:
naturalized world-models — Problems related to modeling large, complex physical environments that lack a sharp agent/environment boundary. Central examples of problems in this category include logical uncertainty, naturalized induction, multi-level world models, and ontological crises.
Introductory resources: “Formalizing Two Problems of Realistic World-Models,” “Questions of Reasoning Under Logical Uncertainty,” “Logical Induction,” “Reflective Oracles”
Examples of recent work: “Hyperreal Brouwer,” “An Untrollable Mathematician,” “Further Progress on a Bayesian Version of Logical Uncertainty”
decision theory — Problems related to modeling the consequences of different (actual and counterfactual) decision outputs, so that the decision-maker can choose the output with the best consequences. Central problems include counterfactuals, updatelessness, coordination, extortion, and reflective stability.
Introductory resources: “Cheating Death in Damascus,” “Decisions Are For Making Bad Outcomes Inconsistent,” “Functional Decision Theory”
Examples of recent work: “Cooperative Oracles,” “Smoking Lesion Steelman” (1, 2), “The Happy Dance Problem,” “Reflective Oracles as a Solution to the Converse Lawvere Problem”
robust delegation — Problems related to building highly capable agents that can be trusted to carry out some task on one’s behalf. Central problems include corrigibility, value learning, informed oversight, and Vingean reflection.
Introductory resources: “The Value Learning Problem,” “Corrigibility,” “Problem of Fully Updated Deference,” “Vingean Reflection,” “Using Machine Learning to Address AI Risk”
Examples of recent work: “Categorizing Variants of Goodhart’s Law,” “Stable Pointers to Value”
subsystem alignment — Problems related to ensuring that an AI system’s subsystems are not working at cross purposes, and in particular that the system avoids creating internal subprocesses that optimize for unintended goals. Central problems include benign induction.
Introductory resources: “What Does the Universal Prior Actually Look Like?”, “Optimization Daemons,” “Modeling Distant Superintelligences”
Examples of recent work: “Some Problems with Making Induction Benign”
other — Alignment research that doesn’t fall into the above categories. If we make progress on the open problems described in “Alignment for Advanced ML Systems,” and the progress is less connected to our agent foundations work and more ML-oriented, then we’ll likely classify it here.
Read more »