Josef Urban is a postdoctoral researcher at the Institute for Computing and Information Sciences of the Radboud University in Nijmegen, the Netherlands. His main interest is development of combined inductive and deductive AI methods over large formal (fully semantically specified) knowledge bases, such as large corpora of formally stated mathematical definitions, theorems and proofs. This involves deductive AI fields such as Automated Theorem Proving (ATP) and inductive AI fields such as Machine Learning. In 2003, he made the largest corpus of formal mathematics – the Mizar library – available to ATP and AI methods, and since then has been developing the first approaches, systems and benchmarks for AI and automated reasoning over such corpora. His Machine Learner for Automated Reasoning (MaLARea) has in 2013 solved by a large margin most problems in the mathematically oriented Large Theory division of the CADE Automated Systems Competition (CASC). The system is based on several positive-feedback loops between deducing new proofs and learning how to guide the reasoning process based on successful proofs. Such AI methods are also being deployed as systems that assist interactive theorem proving and formal verification in systems like Mizar, Isabelle and HOL Light.

He received his PhD in Computers Science from the Charles University in Prague in 2004, and MSc in Mathematics at the same university in 1998. In 2003, he co-founded the Prague Automated Reasoning Group, and from 2006 to 2008 he was a Marie-Curie fellow at the University of Miami.

**Luke Muehlhauser**: In Urban & Vyskocil (2013) and other articles, you write about the state of the art in both the formalization of mathematics and automated theorem proving. Before I ask about machine learning in this context, could you quickly sum up for us what we can and can’t do at this stage, when it comes to automated theorem proving?

**Josef Urban**: The best-known result is the proof of the Robbins conjecture found automatically by Bill McCune’s system EQP in 1996. Automated theorem proving (ATP) can be used today for solving some open equational algebraic problems, for example in the quasigroup and loop theory. Proofs of such open problems may be thousands of inferences long, and quite different from the proofs that mathematicians produce. Solving such problems with ATP is often not a “push-button” matter. It may require specific problem reformulation (usually involving only a few axioms) suitable for ATP, exploration of suitable parameters for the ATP search, and sometimes also guiding the final proof attempt by lemmas (“hints”) that were useful for solving simpler related problems.

ATP is today also used for proving small lemmas in large formal libraries developed with interactive theorem provers (ITPs) such as Mizar, Isabelle and HOL Light. This is called “large-theory” ATP. In the past ten years, we have found that even with the large mathematical libraries (thousands of definitions and theorems) available in these ITPs, it is often possible to automatically select the relevant previous knowledge from such libraries and find proofs of small lemmas automatically without any manual pre-processing. Such proofs are usually much shorter than the long proofs of the equational algebraic problems, but even that already helps a lot to the people who use ITPs to formally prove more advanced theorems.

An illustrative example is the Flyspeck (“Formal Proof of Kepler”) project led by Tom Hales, who proved the Kepler conjecture in 1998. The proof is so involved that the only way how to ensure its correctness is formal verification with an ITP. Tom Hales has recently written a 300-page book about the proof, with the remarkable property that most of the concepts, lemmas and proofs have a computer-understandable “Flyspeck” counterpart that is stated and verified in HOL Light. Hales estimates that this has taken about 20 man-years. Recently, we have shown that the proofs of about 47% of the 14000 small Flyspeck lemmas could be found automatically by large-theory ATP methods. This means that the only required mental effort is to formally state these lemmas in HOL Light. Again, the caveat is that the lemmas have to be quite easy to prove from other lemmas that are already in the library. On the other hand, this means that once we have such large corpus of formally-stated everyday mathematics, it is possible to use the corpus together with the large-theory ATP methods to solve many simple formally-stated mathematical problems.

Another interesting recent development are ATP methods that efficiently combine “reasoning” and “computing”. This has been a long-time challenge for ATP systems, particularly when used for software and hardware verification. Recent SMT (Satisfiability Modulo Theories) systems such as Z3 integrate a number of decision procedures, typically about linear and nonlinear arithmetics, bitvectors, etc. I am not an expert in this area, but I get the impression that the growing strength of such systems is making the full formal verification of complex hardware and software systems more and more realistic.

**Luke**: As you see it, what are the most important factors limiting the success and scope of ATP algorithms?

**Josef**: (1) Their low reasoning power, particularly in large and advanced theories, and (2) lack of computer understanding of current human-level (math) knowledge.

The main cause of (1) is that so far the general (“universal”) reasoning algorithms often use too much brute force, without much smart guidance, specific methods for specific subproblems, and methods for self-improvement. Even on mathematical problems that are considered quite easy, such brute-force algorithms then often explode. (2) is a major obstacle because even if the strength of ATP algorithms could be already useful for simple query-answering applications in general mathematics, the systems still do not understand problems posed using the human-level mathematical language and concepts. (1) and (2) are connected: human math prose can be arbitrarily high-level, requiring the discovery of long reasoning chains just for parsing and filling the justification gaps.

One path to improving (1) leads through better (and preferably automated) analysis of how mathematicians think. In other words, through mining and understanding what they know, how they use and produce it, and how they improve their knowledge and reasoning methods. If that is correct, then it should be useful to have large corpora of mathematical knowledge in a (possibly “annotated”) format that can be at least correctly parsed, and then further analysed and perhaps used for building various self-improving ATP algorithms. To some extent, this has been started in the last decade, even though manual (“theory-driven”) engineering of ATP algorithms still largely prevails. Unfortunately, large “ATP-parsable” (or annotated) corpora are still quite rare and expensive to produce today, exactly because of (2), i.e., the gaps between human-level and computer-level understanding and reasoning. Needless to say that dually annotated human-computer corpora of mathematics should be also (together with decent ATP) useful for a “bootstrapped” solving of (2), i.e., for learning more and more computer understanding of human-level mathematical texts. Thanks to the fact that mathematics is the only field with deep formal semantics, study of such corpora might also eventually lead us to better semantic understanding of natural language and strong AI for other sciences.

In software and hardware verification, (1) is probably the main problem, although, for example, semantic annotation of large code libraries is also a major effort. Solving (1) is still arbitrarily hard, but it seems that “code” is generally not so diverse as “mathematics”, and it might be possible to get quite far with a smaller number of manually programmed techniques.

**Luke**: How much of human mathematical knowledge has been formalized such that automated theorem provers can use it? At what pace is that base of formalized mathematics growing?

**Josef**: One metric that we have is the list of 100 theorems created in 1999 and tracked for formal systems by Freek Wiedijk. Currently, 88% of these 100 theorems have been proved in at least one proof assistant, 87% in HOL Light alone. A lot of the undergraduate curriculum is probably already covered to a good extent: the basics of real/complex calculus, measure theory, algebra and linear algebra, general topology, set theory, category theory, combinatorics and graph theory, logic, etc. But I think the PhD-level coverage is still quite far.

There are some advanced mathematical formalizations like the recent formal proof of the Feit-Thompson theorem in Coq, the already mentioned Flyspeck project done in HOL Light, formalization of more than a half of the Compendium of Continuous Lattices in Mizar, and many smaller interesting projects. Some overview pages of the theorems and theories formalized in the systems are the MML Query (Mizar), the Archive of Formal Proofs (Isabelle), the Coq contribs, and the 100-theorems project in HOL Light.

The rate of growth of formal math is not very high, but also the number of authors is quite low. My very rough estimate is about 10k-20k top-level lemmas per year with some 100-300 more important theorems among them. But there is quite a lot of repetition and duplication between various systems, their various libraries, and sometimes even inside one library. People are trying various approaches to formalization, and sometimes prefer to design their own formalization from scratch rather than trying to re-use the work of somebody else. This is quite similar to code libraries. Making the libraries as re-usable as possible is a nontrivial effort.

Strictly speaking, only Mizar, Isabelle and HOL Light are currently accessible to large-theory ATP systems that really try to use the whole libraries at once to prove new conjectures. Coq’s logic is more different from the logics used by the strongest ATP systems and there is so far no sufficiently complete large-theory link between them. But there is also a lot of smaller-scale programmable proof automation inside Coq (and also other interactive provers) already.

**Luke**: The prospect of having large databases of formalized mathematical knowledge naturally leads to the question: “Might we be able to use machine learning algorithms for improved automated theorem proving?” What is the current state of machine learning in the context of automated theorem proving?

**Josef**: Most of the machine learning applications are today in the context of large-theory ATP. The task that has the most developed learning methods is selection of the most relevant theorems and definitions from a large library when proving a new conjecture. This is important for two reasons: (i) ATPs can today often get quite far when given the right axioms that are sufficient, not very redundant, and reasonably close to the conjecture, and (ii) adding irrelevant axioms to the ATP search space often quickly diminishes the chances of finding the proof within reasonable time. This selection problem became known as “premise selection” and a number of learning and non-learning algorithms have been tried for it already and combined in various ways. The oldest one tried since 2003 is naive Bayes, but more recently we tried also kernel methods, various versions of distance-weighted k-nearest neighbor, random forests, and some basic ensemble methods.

A major issue is the choice of good characterization of mathematical objects such as formulas and proofs for the learning algorithms. In a large diverse library, just using the set of symbols as formula features is already useful. This has been extended by methods ranging from purely syntactic n-gram analogs (terms and subterms and their parts, normalized in various ways), to more semantic but still quite syntactic features such as addition of type information and type hierarchies, to strongly semantic features such as the validity of formulas in a large pool of diverse models. Feature preprocessing methods such as TF-IDF and latent semantic analysis help a lot, and quite likely more can be done still.

Another major issue are the proofs used for training. Some proofs are easy for humans, but hard for today’s ATPs, so it is often better to find an ATP proof if possible and learn from it, instead of trying to learn from formally correct but ATP-infeasible proofs. This gives rise to exploratory AI systems that interleave deductive proof-finding and learning from proofs. Positive feedback loops appear: the more ATP proofs (and also counter-examples) we find, the better the next learning iteration, and the more ATP proofs we usually find in the next iteration of learning-advised proof finding. In some sense, one cannot easily “just do inductive AI (learning)” or “just do deductive AI (ATP)”: the strongest approach to large-theory ATP is to really interleave theorem proving with learning from the proofs. This should not be too surprising, because neither humans do science just by pure induction or just by pure deduction. But to me it seems really useful to have today at least one AI domain that allows such combined techniques and pragmatic experiments with them.

A bit similar ideas apply when we stop treating ATPs as black boxes and start to look inside them. Instead of evolving an inductive/deductive system that is good at selecting relevant premises for conjectures (MaLARea is an example), we may try to evolve a system that has a number of specialized problem-solving techniques (ATP strategies) for common classes of problems, and some intuition about which techniques to use for which problem classes. An example is the Blind Strategymaker, which on a large library of different problems co-evolves a population of ATP strategies (treated as sets of search-directing parameters) with a population of solvable training problems, with the ultimate goal of having a reasonably small set of strategies that together solve as many of the problems as possible. The initial expert strategies (“predators”) are mutated and quickly evaluated on their easy prey (the easy training problems they specialize in), and if the mutations show promise (faster solutions) on such training subset, they undergo a more expensive (more time allowed) evaluation on a much wider set of problems, possibly solving some previously unsolvable ones and making them into further training targets. Again, just random mutating on the so-far-unsolved problems is quite inefficient, so one really needs the intuition about which training data the mutations should be grown on. This again yields a fast “inductive” training phase, followed (if successful) by a slower “hard thinking” phase, in which the newly trained strategies attempt to solve some more problems, making them into further training data. The intuition and the deductive capability co-evolve again; doing just one of them does not work so well.

And going again from black-box to white-box, one can also ask what are really the ATP strategies and deductive techniques, and how to steer/program them more than just by finding good combinations of predefined search parameters for problem classes. There are several learning methods that go beyond such parameter settings, the most prominent and successful is probably the “hints” method by Bob Veroff (a bit differently done also by Stephan Schulz), that directs the proof search towards the lemmas (hints) that were found useful in proofs of related problems. I believe that learning such internal guidance is still quite an unexplored territory that we should be working on. One might not only direct the search towards previously useful lemmas, but also look at methods that suggest completely new lemmas and concepts, based on analogies with other problems, deeper knowledge about the problem state, etc. Depending on one’s idea about what is and what is not machine learning, one could also relate here to Douglas Lenat’s seminal work on concept and lemma-finding systems such as the Automated Mathematician and Eurisko.

**Luke**: What trends in this area do you expect in the next 20 years? Do you expect that in 20 years a much greater share of research mathematicians will use ATPs, at least in a highly interactive way (*ala* Lenat with Eurisko)? Do you expect learning (rather than non-learning) ATPs to predominate in the future? Do you expect things to advance so far that the jobs of *some* research mathematicians will be effectively replaced, just as some paralegal jobs have (supposedly) been replaced by better software for searching through legal databases?

**Josef**: First, AI predictions usually happen, but often take longer than expected. 20 years is not really much in these low-profile fields: Flyspeck itself has taken about 20 person-years and Mizar took 40 years from a visionary talk by Andrzej Trybulec in 1973 to the current 1200 articles in the Mizar library. There are not many large projects in these areas and often they are pushed only by extreme determination of single researchers.

One AI breakthrough that I believe is quite reachable within 20 years is deep *automated* semantic understanding (formalization, “machine reading”) of most of LaTeX-written mathematical textbooks. This has been for long time blocked by three factors: (i) lack of annotated formal/informal corpora to train such semantic parsing on, (ii) lack of sufficiently large repository of background mathematical knowledge needed for “obvious-knowledge gap-filling”, and (iii) lack of sufficiently strong large-theory ATP that could fill the reasoning gaps using the large repository of background knowledge. This has changed a lot recently and unless we are really lazy I believe it will mostly happen. It might not be immediately perfect and on the level of manual formalization, but it will gradually improve both by using more data and by better algorithms, a bit similarly to what has been happening with Google Translate between languages like English, Spanish and German. Once it starts happening automatically, it might snowball faster and faster, again thanks to the positive feedback loop between more “reasoning data” becoming accessible and the strength of the data-driven large-theory ATP and AI methods trained on such data.

I don’t think that mathematicians have any reason to “fear AI” (at least at the moment), quite the contrary: more and more are wondering about the current low level of computer understanding and assistance of math. Mathematicians like von Neumann, Turing and Goedel are the reason that we have computers and AI as a discipline in the first place. One of their motivations was Hilbert’s program and the Leibniz-style dreams of machines assisting mathematics and scientific reasoning. It is even a bit of a shame for computer science and AI that in the hundred years since Turing’s birth they have done so little for their “parent” science and that the progress is so slow. At this moment, anybody who tried writing formal mathematics prays for better automation. Without it, formalization will remain a slow, costly and marginal effort. So it is more the opposite: in order for deep semantic treatment of math and sciences to take off, automation has to improve. But once such deep semantic treatment of sciences starts to develop, it means also easier access for mathematicians. Law is just one example; I really like John McCarthy’s futuristic note on formal proof becoming a strong criterion for all kinds of policy making. Complex mathematical proofs that require formal checking are just a tip of the iceberg: today we are building and relying on more and more complex machinery, hardware, software, legal, political and economical systems, and they are all buggy and easy to hack and break. Fixing this all and allowing us to correctly implement much more complex designs is a great future opportunity and market for mathematically thinking people equipped with automation tools that allow them to make fast progress without sacrificing correctness.

I do expect that ATPs will in general have better pattern-detection capabilities in 20 years than they have now, and that they will be able to better accumulate, process and efficiently use previous knowledge and better combine brute-force search with all kinds of guidance on various levels. Specifically in more advanced mathematics, high-level heuristic human-inspired proving/thinking methods might start to be more developed. One way to try to get them automatically is again through basic computer understanding of LaTeX-written mathematical texts, and learning what high-level concepts like “by analogy” and “using diagonalization” exactly semantically mean in various contexts. This is also related to the ability to reformulate problems and map them to a setting (for current ATPs, the best is purely equational) where the theorem proving becomes more easy. And another related work that needs to be done is “explaining back” the long ATP proofs using an understandable mathematical presentation.

The word “learning” has itself many meanings. A recent breakthrough in SAT solving (propositional ATP) is called “conflict-driven clause learning” (CDCL), but it is just a correct (resolution) inference of a new lemma implied by the axioms. The generalization aspect of “learning” is quite limited there. I don’t really know how good the ATPs will be in twenty years, improvements may come from various sides, not just by using “learning”. The main ATP calculi are so far so simple and brute-force, that it is also possible that somebody will come up with a completely new approach which will considerably improve the strength of ATP systems. For example, instantiation-based ATPs like iProver relying on the CDCL-extended SAT solvers have been improved quite a lot in the recent years.

Another trend I expect is more “manual” work on all kinds of targeted decision procedures (dealing with numbers, lists, bitvectors, etc.), particularly when useful for software and hardware verification, and their integration with ATP calculi. I also hope that we will start to be able to detect such targeted algorithms automatically from the successful reasoning patterns done by ATPs. In some sense, we need better automated compilation of “search” into “computing” (we could call it “learning of decision procedures”). There is no rigid distinction between these two: in ATP-inspired programming languages like Prolog, the two things largely coincide. The better we know how to direct the proof search in certain (e.g. numerical) problem classes, the more the proof search turns into efficient computation.

Finally, I might venture one concrete and refutable prediction about the strength of large-theory ATP in 20 years: on the Flyspeck resp. Mizar libraries, learning-assisted large-theory ATP can today prove 47% resp. 40% of the toplevel lemmas. In 20 years, using the same hardware and resources (i.e., not relying on Moore’s Law), we will be able to prove automatically 80% of both (measured on the same versions of the libraries).

**Luke:** Thanks, Josef!

**Did you like this post?** You may enjoy our other Conversations posts, including: