Embedded Agency is a write-up by Abram Demski and Scott Garrabrant, available on the AI Alignment Forum here. There’s also a shorter version of the post as a hand-drawn sequence, and a lightly rewritten version on arXiv.

We’ve included links and references below, listed in the order they come up in the relevant topic/section.




( Text Introduction  —  Illustrated Introduction  ———  MIRI Blog Afterword  —  LessWrong Afterword )


Further reading: “Security Mindset and Ordinary Paranoia”; “Agent Foundations for Aligning Machine Intelligence with Human Interests



Decision Theory

( Text Version  —  Illustrated Version )




Embedded World-Models

( Text Version  —  Illustrated Version )


Further reading: “The Problem with AIXI



Robust Delegation

( Text Version  —  Illustrated Version )


Further reading: “Problem of Fully Updated Deference



Subsystem Alignment

( Text Version  —  Illustrated Version )


  • Eliezer Yudkowsky. 2017. “Non-Adversarial Principle.” Arbital.
  • Scott Garrabrant. 2018. “Robustness to Scale.” Less Wrong.
  • Eliezer Yudkowsky. 2015. “Omnipotence Test for AI Safety.” Arbital.
  • Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems (NIPS) 27.
  • Eliezer Yudkowsky. 2016. “Optimization Daemons.” Arbital.
  • Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Forthcoming. “The Inner Alignment Problem.” Draft.
  • Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. “Concrete Problems in AI Safety.” arXiv:1606.06565 [cs.AI].