Daniel Roy on probabilistic programming and AI

 |   |  Conversations

Daniel Roy portrait Daniel Roy is an Assistant Professor of Statistics at the University of Toronto. Roy earned an S.B. and M.Eng. in Electrical Engineering and Computer Science, and a Ph.D. in Computer Science, from MIT.  His dissertation on probabilistic programming received the department’s George M Sprowls Thesis Award.  Subsequently, he held a Newton International Fellowship of the Royal Society, hosted by the Machine Learning Group at the University of Cambridge, and then held a Research Fellowship at Emmanuel College. Roy’s research focuses on theoretical questions that mix computer science, statistics, and probability.

Luke Muehlhauser: The abstract of Ackerman, Freer, and Roy (2010) begins:

As inductive inference and machine learning methods in computer science see continued success, researchers are aiming to describe even more complex probabilistic models and inference algorithms. What are the limits of mechanizing probabilistic inference? We investigate the computability of conditional probability… and show that there are computable joint distributions with noncomputable conditional distributions, ruling out the prospect of general inference algorithms.

In what sense does your result (with Ackerman & Freer) rule out the prospect of general inference algorithms?

Daniel Roy: First, it’s important to highlight that when we say “probabilistic inference” we are referring to the problem of computing conditional probabilities, while highlighting the role of conditioning in Bayesian statistical analysis.

Bayesian inference centers around so-called posterior distributions. From a subjectivist standpoint, the posterior represents one’s updated beliefs after seeing (i.e., conditioning on) the data. Mathematically, a posterior distribution is simply a conditional distribution (and every conditional distribution can be interpreted as a posterior distribution in some statistical model), and so our study of the computability of conditioning also bears on the problem of computing posterior distributions, which is arguably one of the core computational problems in Bayesian analyses.

Second, it’s important to clarify what we mean by “general inference”. In machine learning and artificial intelligence (AI), there is a long tradition of defining formal languages in which one can specify probabilistic models over a collection of variables. Defining distributions can be difficult, but these languages can make it much more straightforward.

The goal is then to design algorithms that can use these representations to support important operations, like computing conditional distributions. Bayesian networks can be thought of as such a language: You specify a distribution over a collection of variables by specifying a graph over these variables, which breaks down the entire distribution into “local” conditional distributions corresponding with each node, which are themselves often represented as tables of probabilities (at least in the case where all variables take on only a finite set of values). Together, the graph and the local conditional distributions determine a unique distribution over all the variables.

An inference algorithms that support the entire class of all finite, discrete, Bayesian networks might be called general, but as a class of distributions, those having finite, discrete Bayesian networks is a rather small one.

In this work, we are interested in the prospect of algorithms that work on very large classes of distributions. Namely, we are considering the class of samplable distributions, i.e., the class of distributions for which there exists a probabilistic program that can generate a sample using, e.g., uniformly distributed random numbers or independent coin flips as a source of randomness. The class of samplable distributions is a natural one: indeed it is equivalent to the class of computable distributions, i.e., those for which we can devise algorithms to compute lower bounds on probabilities from descriptions of open sets. The class of samplable distributions is also equivalent to the class of distributions for which we can compute expectations from descriptions of bounded continuous functions.

The class of samplable distributions is, in a sense, the richest class you might hope to deal with. The question we asked was: is there an algorithm that, given a samplable distribution on two variables X and Y, represented by a program that samples values for both variables, can compute the conditional distribution of, say, Y given X=x, for almost all values for X? When X takes values in a finite, discrete set, e.g., when X is binary valued, there is a general algorithm, although it is inefficient. But when X is continuous, e.g., when it can take on every value in the unit interval [0,1], then problems can arise. In particular, there exists a distribution on a pair of numbers in [0,1] from which one can generate perfect samples, but for which it is impossible to compute conditional probabilities for one of the variables given the other. As one might expect, the proof reduces the halting problem to that of conditioning a specially crafted distribution.

This pathological distribution rules out the possibility of a general algorithm for conditioning (equivalently, for probabilistic inference). The paper ends by giving some further conditions that, when present, allow one to devise general inference algorithms. Those familiar with computing conditional distributions for finite-dimensional statistical models will not be surprised that conditions necessary for Bayes’ theorem are one example.

Luke: In your dissertation (and perhaps elsewhere) you express a particular interest in the relevance of probabilistic programming to AI, including the original aim of AI to build machines which rival the general intelligence of a human. How would you describe the relevance of probabilistic programming to the long-term dream of AI?

Daniel: If you look at early probabilistic programming systems, they were built by AI researchers: De Raedt, Koller, McAllester, Muggleton, Pfeffer, Poole, Sato, to name a few. The Church language, which was introduced in joint work with Bonawitz, Mansinghka, Goodman, and Tenenbaum while I was a graduate student at MIT, was conceived inside a cognitive science laboratory, foremost to give us a language rich enough to express the range of models that people were inventing all around us. So, for me, there’s always been a deep connection. On the other hand, the machine learning community as a whole is somewhat allergic to AI and so the pitch to that community has more often been pragmatic: these systems may someday allow experts to conceive, prototype, and deploy much larger probabilistic systems, and at the same time, empower a much larger community of nonexperts to use probabilistic modeling techniques to understand their data. This is the basis for the DARPA PPAML program, which is funding 8 or so teams to engineer scalable systems over the next 4 years.

From an AI perspective, probabilistic programs are an extremely general representation of knowledge, and one that identifies uncertainty with stochastic computation. Freer, Tenenbaum, and I recently wrote a book chapter for the Turing centennial that uses a classical medical diagnosis example to showcase the flexibility of probabilistic programs and a general QUERY operator for performing probabilistic conditioning. Admittedly, the book chapter ignores the computational complexity of the QUERY operator, and any serious proposal towards AI cannot do this indefinitely. Understanding when we can hope to efficiently update our knowledge in light of new observations is a rich source of research questions, both applied and theoretical, spanning not only AI and machine learning, but also statistics, probability, physics, theoretical computer science, etc.

Luke: Is it fair to think of QUERY as a “toy model” that we can work with in concrete ways to gain more general insights into certain parts of the long-term AI research agenda, even though QUERY is unlikely to be directly implemented in advanced AI systems? (E.g. that’s how I think of AIXI.)

Daniel: I would hesitate to call QUERY a toy model. Conditional probability is a difficult concept to master, but, for those adept at reasoning about the execution of programs, QUERY demystifies the concept considerably. QUERY is an important conceptual model of probabilistic conditioning.

That said, the simple guess-and-check algorithm we present in our Turing article runs in time inversely proportional to the probability of the event/data on which one is conditioning. In most statistical settings, the probability of a data set decays exponentially towards 0 as a function of the number of data points, and so guess-and-check is only useful for reasoning with toy data sets in these settings. It should come as no surprise to hear that state-of-the-art probabilistic programming systems work nothing like this.

On the other hand, QUERY, whether implemented in a rudimentary fashion or not, can be used to represent and reason probabilistically about arbitrary computational processes, whether they are models of the arrival time of spam, the spread of disease through networks, or the light hitting our retinas. Computer scientists, especially those who might have had a narrow view of the purview of probability and statistics, will see a much greater overlap between these fields and their own once they understand QUERY.

To those familiar with AIXI, the difference is hopefully clear: QUERY performs probabilistic reasoning in a model given as input. AIXI, on the other hand, is itself a “universal” model that, although not computable, would likely predict (hyper)intelligent behavior, were we (counterfactually) able to perform the requisite probabilistic inferences (and feed it enough data). Hutter gives an algorithm implementing an approximation to AIXI, but its computational complexity still scales exponentially in space. AIXI is fascinating in many ways: If we ignore computational realities, we get a complete proposal for AI. On the other hand, AIXI and its approximations take maximal advantage of this computational leeway and are, therefore, ultimately unsatisfying. For me, AIXI and related ideas highlight that AI must be as much a study of the particular as it of the universal. Which potentially unverifiable, but useful, assumptions will enable us to efficiently represent, update, and act upon knowledge under uncertainty?

Luke: You write that “AI must be as much a study of the particular as it is of the universal.” Naturally, most AI scientists are working on the particular, the near term, the applied. In your view, what are some other examples of work on the universal, in AI? Schmidhuber’s Gödel machine comes to mind, and also some work that is as likely to be done in a logic or formal philosophy department as a computer science department — e.g. perhaps work on logical priors — but I’d love to hear what kinds of work you’re thinking of.

Daniel: I wouldn’t equate any two of the particular, near-term, or applied. By the word particular, I am referring to, e.g., the way that our environment affects, but is also affected by, our minds, especially through society. More concretely, both the physical spaces in which most of us spend our days and the mental concepts we regularly use to think about our daily activities are products of the human mind. But more importantly, these physical and mental spaces are necessarily ones that are easily navigated by our minds. The coevolution by which this interaction plays out is not well studied in the context of AI. And to the extent that this cycle dominates, we would expect a universal AI to be truly alien. On the other hand, exploiting the constraints of human constructs may allow us to build more effective AIs.

As for the universal, I have an interest in the way that noise can render idealized operations computable or even efficiently computable. In our work on the computability of conditioning that came up earlier in the discussion, we show that adding sufficiently smooth independent noise to a random variable allows us to perform conditioning in situations where we would not have been able to otherwise. There are examples of this idea elsewhere. For example, Braverman, Grigo, and Rojas study noise and intractability in dynamical systems. Specifically, they show that computing the invariant measure characterizing the long-term statistical behavior of dynamical systems is not possible. The road block is the computational power of the dynamical system itself. The addition of a small amount of noise to the dynamics, however, decreases the computational power of the dynamical system, and suffices to make the invariant measure computable. In a world subject to noise (or, at least, well modeled as such), it seems that many theoretical obstructions melt away.

Luke: Thanks, Daniel!