Reflection in Logic
How can we develop a rigorous basis for reflective reasoning?
A trustworthy machine intelligence will need to reason coherently about its own behavior and about the behavior of similar agents, but such reflection runs into fundamental difficulties articulated by Gödel, Tarski, Löb, and others. At MIRI, we aim to develop a rigorous (rather than ad-hoc) basis for reflective reasoning.
We model self-modification in AI by introducing “tiling” agents whose decision systems will approve the construction of highly similar agents, creating a repeating pattern (including similarity of the offspring’s goals). Constructing a formalism in the most straightforward way produces a Gödelian difficulty, the “Löbian obstacle.” By technical methods we demonstrates the possibility of avoiding this obstacle, but the underlying puzzles of rational coherence are thus only partially addressed. We extend the formalism to partially unknown deterministic environments, and show a very crude extension to probabilistic environments and expected utility; but the problem of finding a fundamental decision criterion for self-modifying probabilistic agents remains open.
Alfred Tarski showed that it is impossible for any expressive formal language to contain its own truth predicate. We show that an expressive formal language can nevertheless contain its own “subjective probability” function. The assigned probabilities can be reflectively consistent in the sense of an appropriate analog of the reflection property. In practice, most meaningful assertions must already be treated probabilistically, and very little is lost by allowing some sentences to have probabilities intermediate between 0 and 1.
How can we develop a better formal foundation for decision making under uncertainty?
Standard decision procedures are not well-specified enough (e.g., with regard to counterfactuals) to be instantiated as algorithms. These procedures also tend to be inconsistent under reflection: an agent that initially uses causal decision theory will regret doing so, and will attempt to change its own decision procedure. When building trustworthy AIs, we’ll need to develop a better formal foundation for decision making under uncertainty.
Classical game-theoretic agents defect in the Prisoner’s Dilemma even though mutual cooperation would yield higher utility for both agents. Moshe Tennenholtz showed that if each program is allowed to pass its playing strategy to all other players, some programs can then cooperate on the one-shot prisoner’s dilemma. Program equilibria is Tennenholtz’s term for Nash equilibria in a context where programs can pass their playing strategies to the other players.
One weakness of this approach so far has been that any two programs which make different choices cannot “recognize” each other for mutual cooperation, even if they are functionally identical. In this paper, provability logic is used to enable a more flexible and secure form of mutual cooperation.
When formulated using Bayesian networks, two standard decision procedures can be shown to fail systematically when faced with aspects of the prisoner’s dilemma and so-called “Newcomblike” problems. We describe a new form of decision procedure, called Timeless Decision Theory, which consistently wins on these problems.
How can we formally specify an AI’s goals, such that the formalism matches our intentions? How can we ensure those intended goals are preserved even as an AI modifies itself?
AI systems too complex for predefined environment models and actions will need to learn environment models and to choose actions that optimize some criteria. Several authors have described mechanisms by which such complex systems may behave in unintended ways. For AI systems that may pose a threat to humans, we propose a two-stage agent architecture that avoids some known types of unintended behavior. For the first stage of the architecture we show that the most probable finite stochastic program to model a finite history is finitely computable, and that there is an agent that makes such a computation without unintended instrumental actions.
Decision-theoretic agents predict and evaluate the results of their actions using a model of their environment. An agent’s goal, or utility function, may also be specified in terms of the entities within its model. If the agent may upgrade or replace its model, it faces a crisis: the agent’s original goal may not be well-defined with respect to its new model. This crisis must be resolved before the agent can make plans towards achieving its goals. We show a solution to the problem for a limited class of models, and suggest a way forward for finding solutions for broader classes of models.
Evidence from the psychology of motivation, moral psychology, and neuroeconomics suggests that human values are complex and fragile, and thus it is difficult to formally specify an AI’s goals such that the formalism matches our intentions. After examining several approaches to solving this problem, we recommend “ideal preference” approaches as most promising for further development.
What can we predict about future AI? Which interventions appear to be the most beneficial?
In addition to our mathematical research, MIRI investigates important strategic questions, such as: What can (and can’t) we predict about future AI? How can we improve our forecasting ability? Which interventions available today appear to be the most beneficial, given what little we do know?
I.J. Good suggested that a sufficiently advanced machine intelligence could build a smarter version of itself, which could in turn build an even smarter version, and that this process could continue to the point of vastly surpassing human capability. How can we model and test this hypothesis? We identify the key issue as returns on cognitive reinvestment—the ability to invest more computing power, faster computers, or improved cognitive algorithms to yield cognitive labor which produces larger brains, faster brains, or better mind designs. Many phenomena have been claimed as evidence for various positions in this debate, from the observed course of hominid evolution to Moore’s Law to the competence over time of chess programs. This paper explores issues that arise when trying to interpret this evidence in light of Good’s hypothesis, and proposes that the next step in this research is to formalize return-on-investment curves, so that each position can formally state which models they hold to be falsified by historical observations.
There are strong theoretical reasons to expect timeline forecasts of human-level AI to be quite poor. Moreover, a database of 95 historical AI timeline forecasts demonstrates that these expectations are borne out in practice: expert predictions contradict each other considerably, and are indistinguishable from non-expert predictions and past failed predictions. Predictions that AI lies 15 to 25 years in the future are the most common, from experts and non-experts alike.
We review the evidence for and against three claims: (1) there is a substantial chance we will create human-level AI before 2100, that (2) if human-level AI is created, there is a good chance vastly superhuman AI will follow via an “intelligence explosion,” and that (3) an uncontrolled intelligence explosion could destroy everything we value, but a controlled intelligence explosion would benefit humanity enormously if we can achieve it. The paper concludes with recommendations for increasing the odds of a controlled intelligence explosion relative to an uncontrolled intelligence explosion.