Risks from Learned Optimization in Advanced ML Systems

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant

This paper is available on arXiv, the AI Alignment Forum, and LessWrong.


We analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned? In this paper, we provide an in-depth analysis of these two primary questions and provide an overview of topics for future research.



Section 1 Glossary:

  • Base optimizer: A base optimizer is an optimizer that searches through algorithms according to some objective.
    • Base objective: A base objective is the objective of a base optimizer.
  • Behavioral objective: The behavioral objective is what an optimizer appears to be optimizing for. Formally, the behavioral objective is the objective recovered from perfect inverse reinforcement learning.
  • Inner alignment: The inner alignment problem is the problem of aligning the base and mesa- objectives of an advanced ML system.
  • Learned algorithm: The algorithms that a base optimizer is searching through are called learned algorithms.
  • Mesa-optimizer: A mesa-optimizer is a learned algorithm that is itself an optimizer.
    • Mesa-objective: A mesa-objective is the objective of a mesa-optimizer.
  • Meta-optimizer: A meta-optimizer is a system which is tasked with producing a base optimizer.
  • Optimizer: An optimizer is a system that internally searches through some space of possible outputs, policies, plans, strategies, etc. looking for those that do well according to some internally-represented objective function.
  • Outer alignment: The outer alignment problem is the problem of aligning the base objective of an advanced ML system with the desired goal of the programmers.
  • Pseudo-alignment: A mesa-optimizer is pseudo-aligned with the base objective if it appears aligned on the training data but is not robustly aligned.
  • Robust alignment: A mesa-optimizer is robustly aligned with the base objective if it robustly optimizes for the base objective across distributions.

Section 2 Glossary:

  • Algorithmic range: The algorithmic range of a machine learning system refers to how extensive the set of algorithms capable of being found be the base optimizer is.
  • Local optimization process: A local optimization process is an optimizer that uses local hill-climbing as its means of search.
  • Reachability: The reachability of a learned algorithm refers to the difficulty for the base optimizer to find that learned algorithm.

Section 3 Glossary:

  • Approximate alignment: An approximately aligned mesa-optimizer is a pseudo-aligned mesa-optimizer where the base and mesa- objectives are approximately the same up to some degree of approximation error due to the difficulty of representing the base objective in the mesa-optimizer.
  • Proxy alignment: A proxy aligned mesa-optimizer is a pseudo-aligned mesa-optimizer that has learned to optimize for some proxy of the base objective instead of the base objective itself.
    • Instrumental alignment: Instrumental alignment is a type of proxy alignment in which the mesa-optimizer optimizes the proxy as an instrumental goal of increasing the mesa-objective in the training distribution.
    • Side-effect alignment: Side-effect alignment is a type of proxy alignment in which optimizing for the mesa-objective has the direct causal result of increasing the base objective in the training distribution.
  • Suboptimality alignment: A suboptimality aligned mesa-optimizer is a pseudo-aligned mesa-optimizer in which some deficiency, error, or limitation causes it to exhibit aligned behavior.

Section 4 Glossary:

  • Corrigible alignment: A corrigibly aligned mesa-optimizer is a robustly aligned mesa-optimizer that has a mesa-objective that “points to” its epistemic model of the base objective.
  • Deceptive alignment: A deceptively aligned mesa-optimizer is a pseudoaligned mesa-optimizer that has enough information about the base objective to seem more fit from the perspective of the base optimizer than it actually is.
  • Internal alignment: An internally aligned mesa-optimizer is a robustly aligned mesa-optimizer that has internalized the base objective in its mesa-objective.



  1. Daniel Filan. Bottle caps aren’t optimisers, 2018.
  2. Gregory Farquhar, Tim Rocktäschel, Maximilian Igl, and Shimon Whiteson. TreeQN and ATreeC: Differentiable tree-structured models for deep reinforcement learning. ICLR 2018, 2018.
  3. Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Universal planning networks. ICML 2018, 2018.
  4. Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. NIPS 2016, 2016.
  5. Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. RL2: Fast reinforcement learning via slow reinforcement learning. arXiv, 2016.
  6. Eliezer Yudkowsky. Optimization daemons.
  7. Joe Cheal. What is the opposite of meta? ANLP Acuity Vol. 2.
  8. Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv, 2018.
  9. Eliezer Yudkowsky. Measuring optimization power, 2008.
  10. David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  11. K. E. Drexler. Reframing superintelligence: Comprehensive AI services as generalintelligence. Technical Report #2019-1, Future of Humanity Institute, University of Oxford, 2019.
  12. Ramana Kumar and Scott Garrabrant. Thoughts on human models. MIRI, 2019.
  13. Paul Christiano. What does the universal prior actually look like?, 2016.
  14. Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv, 2014.
  15. Guillermo Valle-Pérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter-function map is biased towards simple functions. ICLR 2019, 2019.
  16. Paul Christiano. Open question: are minimal circuits daemon-free?, 2018.
  17. Chris van Merwijk. Development of AI agents as a principal-agent problem, Forthcoming in 2019.
  18. Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in Atari. NeurIPS 2018, 2018.
  19. Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, 2017.
  20. Kareem Amin and Satinder Singh. Towards resolving unidentifiability in inverse reinforcement learning. arXiv, 2016.
  21. Razvan Pascanu, Yujia Li, Oriol Vinyals, Nicolas Heess, Lars Buesing, Sebastien Racanière, David Reichert, Théophane Weber, Daan Wierstra, and Peter Battaglia. Learning model-based planning from scratch. arXiv, 2017.
  22. David Manheim and Scott Garrabrant. Categorizing variants of Goodhart’s Law. arXiv, 2018.
  23. Nick Bostrom. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014.
  24. Paul Christiano. What failure looks like, 2019.
  25. Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong. Corrigibility. AAAI 2015, 2015
  26. Paul Christiano. Worst-case guarantees, 2019.
  27. Robert J. Aumann, Sergiu Hart, and Motty Perry. The Absent-Minded Driver.. Games and Economic Behavior, 20:102–116, 1997.
  28. Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. CogSci, 2016
  29. Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv, 2016.
  30. Stuart Armstrong and Sören Mindermann. Occam’s razor is insufficient to infer the preferences of irrational agents. NeurIPS 2018, 2017.
  31. Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. Safety verification of deep neural networks. CAV 2017, 2016.
  32. Guy Katz, Clark Barrett, David Dill, Kyle Julian, and Mykel Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. CAV 2017, 2017.
  33. Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Towards practical verification of machine learning: The case of computer vision systems. arXiv, 2017.
  34. Paul Christiano, Buck Shlegeris, and Dario Amodei. Supervising strong learners by amplifying weak experts. arXiv, 2018.
  35. Geoffrey Irving, Paul Christiano, and Dario Amodei. AI safety via debate. arXiv, 2018.