Daniel Roy is an Assistant Professor of Statistics at the University of Toronto. Roy earned an S.B. and M.Eng. in Electrical Engineering and Computer Science, and a Ph.D. in Computer Science, from MIT. His dissertation on probabilistic programming received the department’s George M Sprowls Thesis Award. Subsequently, he held a Newton International Fellowship of the Royal Society, hosted by the Machine Learning Group at the University of Cambridge, and then held a Research Fellowship at Emmanuel College. Roy’s research focuses on theoretical questions that mix computer science, statistics, and probability.
Luke Muehlhauser: The abstract of Ackerman, Freer, and Roy (2010) begins:
As inductive inference and machine learning methods in computer science see continued success, researchers are aiming to describe even more complex probabilistic models and inference algorithms. What are the limits of mechanizing probabilistic inference? We investigate the computability of conditional probability… and show that there are computable joint distributions with noncomputable conditional distributions, ruling out the prospect of general inference algorithms.
In what sense does your result (with Ackerman & Freer) rule out the prospect of general inference algorithms?
Daniel Roy: First, it’s important to highlight that when we say “probabilistic inference” we are referring to the problem of computing conditional probabilities, while highlighting the role of conditioning in Bayesian statistical analysis.
Bayesian inference centers around so-called posterior distributions. From a subjectivist standpoint, the posterior represents one’s updated beliefs after seeing (i.e., conditioning on) the data. Mathematically, a posterior distribution is simply a conditional distribution (and every conditional distribution can be interpreted as a posterior distribution in some statistical model), and so our study of the computability of conditioning also bears on the problem of computing posterior distributions, which is arguably one of the core computational problems in Bayesian analyses.
Second, it’s important to clarify what we mean by “general inference”. In machine learning and artificial intelligence (AI), there is a long tradition of defining formal languages in which one can specify probabilistic models over a collection of variables. Defining distributions can be difficult, but these languages can make it much more straightforward.
The goal is then to design algorithms that can use these representations to support important operations, like computing conditional distributions. Bayesian networks can be thought of as such a language: You specify a distribution over a collection of variables by specifying a graph over these variables, which breaks down the entire distribution into “local” conditional distributions corresponding with each node, which are themselves often represented as tables of probabilities (at least in the case where all variables take on only a finite set of values). Together, the graph and the local conditional distributions determine a unique distribution over all the variables.
An inference algorithms that support the entire class of all finite, discrete, Bayesian networks might be called general, but as a class of distributions, those having finite, discrete Bayesian networks is a rather small one.
In this work, we are interested in the prospect of algorithms that work on very large classes of distributions. Namely, we are considering the class of samplable distributions, i.e., the class of distributions for which there exists a probabilistic program that can generate a sample using, e.g., uniformly distributed random numbers or independent coin flips as a source of randomness. The class of samplable distributions is a natural one: indeed it is equivalent to the class of computable distributions, i.e., those for which we can devise algorithms to compute lower bounds on probabilities from descriptions of open sets. The class of samplable distributions is also equivalent to the class of distributions for which we can compute expectations from descriptions of bounded continuous functions.
The class of samplable distributions is, in a sense, the richest class you might hope to deal with. The question we asked was: is there an algorithm that, given a samplable distribution on two variables X and Y, represented by a program that samples values for both variables, can compute the conditional distribution of, say, Y given X=x, for almost all values for X? When X takes values in a finite, discrete set, e.g., when X is binary valued, there is a general algorithm, although it is inefficient. But when X is continuous, e.g., when it can take on every value in the unit interval [0,1], then problems can arise. In particular, there exists a distribution on a pair of numbers in [0,1] from which one can generate perfect samples, but for which it is impossible to compute conditional probabilities for one of the variables given the other. As one might expect, the proof reduces the halting problem to that of conditioning a specially crafted distribution.
This pathological distribution rules out the possibility of a general algorithm for conditioning (equivalently, for probabilistic inference). The paper ends by giving some further conditions that, when present, allow one to devise general inference algorithms. Those familiar with computing conditional distributions for finite-dimensional statistical models will not be surprised that conditions necessary for Bayes’ theorem are one example.
Read more »