Embedded Agency

Embedded Agency is a write-up by Abram Demski and Scott Garrabrant, available on the AI Alignment Forum here. There’s also a shorter version of the post as a hand-drawn sequence, and a lightly rewritten version on arXiv.

Embedded Agency was first released in 2018, with the arXiv version following in early 2019. In August 2020, Demski and Garrabrant substantially updated all versions.

We’ve included links and references below, listed in the order they come up in the relevant topic/section.

General

( Text Introduction — Illustrated Introduction ——— MIRI Blog Afterword — LessWrong Afterword )

Marcus Hutter. 2012. “One Decade of Universal Artificial Intelligence.” In Theoretical Foundations of Artificial General Intelligence 4.
Nate Soares. 2017. “Ensuring Smarter-Than-Human Intelligence Has A Positive Outcome.” MIRI Blog.
Eliezer Yudkowsky. 2018. “The Rocket Alignment Problem.” MIRI Blog.

Further reading: “Security Mindset and Ordinary Paranoia”; “Agent Foundations for Aligning Machine Intelligence with Human Interests”

Decision Theory

( Text Version — Illustrated Version )

Eliezer Yudkowsky and Nate Soares. 2017. “Functional Decision Theory: A New Theory of Instrumental Rationality.” arXiv:1710.05060 [cs.AI].
Scott Garrabrant. 2017. “Two Major Obstacles for Logical Inductor Decision Theory.” Intelligent Agent Foundations Forum.
Patrick LaVictoire. 2015. An Introduction to Löb’s Theorem in MIRI Research. MIRI technical report 2015–6.
Rob Bensinger. 2017. “Decisions Are For Making Bad Outcomes Inconsistent.” MIRI Blog.
Wei Dai. 2009. “Towards a New Decision Theory.” Less Wrong.
Vladimir Nesov. 2009. “Counterfactual Mugging.” Less Wrong.

Embedded World-Models

( Text Version — Illustrated Version )

Abram Demski. 2018. “Toward a New Technical Explanation of Technical Explanation.” Less Wrong.
Nate Soares. 2015. Formalizing Two Problems of Realistic World-Models. MIRI technical report 2015–3.
Jan Leike. 2016. Nonparametric General Reinforcement Learning. PhD thesis, Australian National University.
Laurent Orseau and Mark Ring. 2012. “Space-Time Embedded Intelligence.” In Artificial General Intelligence, 5th International Conference. Springer.
Benja Fallenstein, Jessica Taylor, and Paul Christiano. 2015. “Reflective Oracles: A Foundation for Classical Game Theory.” arXiv:1508.04145 [cs.AI].
Jan Leike, Jessica Taylor, and Benya Fallenstein. 2016. “A Formal Solution to the Grain of Truth Problem.” Paper presented at the 32nd Conference on Uncertainty in Artificial Intelligence.
Nate Soares and Benja Fallenstein. 2015. Questions of Reasoning under Logical Uncertainty. MIRI technical report 2015–1.
Abram Demski. 2018. “An Untrollable Mathematician Illustrated.” Less Wrong.
Eliezer Yudkowsky. 2017. “Coherent Decisions Imply Consistent Utilities.” Arbital.
Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, Nate Soares, and Jessica Taylor. 2016. “Logical Induction.” arXiv:1609.03543 [cs.AI].
Eliezer Yudkowsky. 2015. “Ontology Identification.” Arbital.
Peter de Blanc. 2011. “Ontological Crises in Artificial Agents’ Value Systems.” arXiv:1105.3821 [cs.AI]
Caspar Oesterheld. 2017. “Naturalized Induction – A Challenge for Evidential and Causal Decision Theory.” Less Wrong.
Rob Bensinger. 2013. “Building Phenomenological Bridges.” Less Wrong.
Thomas Nagel. 1986. The View from Nowhere. Oxford University Press.

Further reading: “The Problem with AIXI”

Robust Delegation

( Text Version — Illustrated Version )

Stuart Armstrong and Sören Mindermann. 2017. “Occam’s Razor is Insufficient to Infer the Preferences of Irrational Agents.” arXiv:1712.05812 [cs.AI].
Benja Fallenstein and Nate Soares. 2015. Vingean Reflection: Reliable Reasoning for Self-Improving Agents. MIRI technical report 2015–2.
Eliezer Yudkowsky and Marcello Herreshoff. 2013. “Tiling Agents for Self-Modifying AI, and the Löbian Obstacle.” Draft.
David Manheim and Scott Garrabrant. 2018. “Categorizing Variants of Goodhart’s Law.” arXiv:1803.04585 [cs.AI].
Nate Soares. 2015/2018. “The Value Learning Problem.” In Artificial Intelligence Safety and Security. Chapman and Hall.
Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong. 2014/2015. “Corrigibility.” Paper presented at the AAAI 2015 Ethics and Artificial Intelligence Workshop.
Paul Christiano. 2016. “The Informed Oversight Problem.” AI Alignment.
Dylan Hadfield-Menell, Stuart Russell, Pieter Abbeel, and Anca Dragan. 2016. “Cooperative Inverse Reinforcement Learning.” In Advances in Neural Information Processing Systems (NIPS) 29.
Scott Garrabrant. 2017. “Logical Updatelessness as a Robust Delegation Problem.” Less Wrong.
Eliezer Yudkowsky. 2015. “Complexity of Value.” Arbital.
Scott Garrabrant. 2018. “Optimization Amplifies.” Less Wrong.
Charles Goodhart. 1981. “Problems of Monetary Management: The UK Experience.” In Inflation, Depression, and Economic Policy in the West. Rowman & Littlefield.
James Smith and Robert Winkler. 2006. “The Optimizer’s Curse: Skepticism and Postdecision Surprise in Decision Analysis.” In Management Science 52:3.
Jessica Taylor. 2016. “Quantilizers: A Safer Alternative to Maximizers for Limited Optimization.” Paper presented at the AAAI 2016 AI, Ethics and Society Workshop.
Daniel Dewey. 2011. “Learning What to Value.” In Proceedings of AGI 2011. Springer.
Abram Demski. 2017. “Stable Pointers to Value: An Agent Embedded in Its Own Utility Function.” Intelligent Agent Foundations Forum.
Tom Everitt, Victoria Krakovna, Laurent Orseau, Marcus Hutter, and Shane Legg. 2017. “Reinforcement Learning with a Corrupted Reward Channel.” In Proceedings of the 26th International Joint Conference on Artificial Intelligence.
Paul Christiano, Buck Shlegeris, and Dario Amodei. 2018. “Supervising Strong Learners by Amplifying Weak Experts.” arXiv:1810.08575 [cs.LG].

Further reading: “Problem of Fully Updated Deference”

Subsystem Alignment

( Text Version — Illustrated Version )

Eliezer Yudkowsky. 2017. “Non-Adversarial Principle.” Arbital.
Scott Garrabrant. 2018. “Robustness to Scale.” Less Wrong.
Eliezer Yudkowsky. 2015. “Omnipotence Test for AI Safety.” Arbital.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems (NIPS) 27.
Eliezer Yudkowsky. 2016. “Optimization Daemons.” Arbital.
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2019. “Risks from Learned Optimization in Advanced Machine Learning Systems.” arXiv:1906.01820. Previously cited in draft form as “The Inner Alignment Problem.”
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. “Concrete Problems in AI Safety.” arXiv:1606.06565 [cs.AI].
Paul Christiano. 2016. “Learning with Catastrophes.” AI Alignment.
Paul Christiano. 2018. “Techniques for Optimizing Worst-Case Performance.” AI Alignment.