Dr. Lyle Ungar is a Professor of Computer and Information Science at the University of Pennsylvania, where he also holds appointments in multiple departments in the schools of Engineering, Arts and Sciences, Medicine, and Business. He has published over 200 articles and is co-inventor on eleven patents. His research areas include machine learning, data and text mining, and psychology, with a current focus on statistical natural language processing, spectral methods, and the use of social media to understand the psychology of individuals and communities.
Luke Muehlhauser: One of your interests (among many) is forecasting. Some of your current work is funded by IARPA’s ACE program — one of the most exciting research programs happening anywhere in the world, if you ask me.
One of your recent papers, co-authored with Barbara Mellers, Jonathan Baron, and several others, is “Psychological Strategies for Winning a Geopolitical Forecasting Tournament.” The abstract is:
Five university-based research groups competed to assign the most accurate probabilities to events in two geopolitical forecasting tournaments. Our group tested and found support for three psychological drivers of accuracy: training, teaming, and tracking. Training corrected cognitive biases, encouraged forecasters to use reference classes, and provided them with heuristics, such as averaging when multiple estimates were available. Teaming allowed forecasters to share information and discuss the rationales behind their beliefs. Tracking placed the highest performers (top 2% from Year 1) in elite teams that worked together. Results showed that probability training improved calibration. Team collaboration and tracking enhanced both calibration and resolution. Forecasting is often viewed as a statistical problem; but it is also a deep psychological problem. Behavioral interventions improved the accuracy of forecasts, and statistical algorithms improved the accuracy of aggregations. Our group produced the best forecasts two years in a row by putting statistics and psychology to work.
In these experiments, some groups were given scenario training or probability training, which “took approximately 45 minutes, and could be examined throughout the tournament.”
Are these modules available to the public online? If not, can you give us a sense of what they were like? And, do you suspect that significant additional probability or scenario training would further reduce forecasting errors, e.g. if new probability training content was administered to subjects for 30 minutes every two weeks?
Lyle Ungar: I’m sorry, but the modules are not publicly available.
Our main probability training module teaches participants not Bayesian probability rules, but approaches to forecasting. It begins with a discussion of two elements of good probability judgment – calibration and resolution, giving definitions and examples. Our module then provides a number of tips for good forecasting judgment, including 1) consider relevant base rates, 2) average over multiple estimates (if available), use historical data, use statistical predictions whenever possible, and consider simple models based on key variables.
I was surprised how much benefit we got from a 45 minute online training, especially given the fact that many people have found that taking a full course in probability has no benefit on peoples’ probability estimation. I think the key is the specific approaches to forecasting.
We are developing follow-up training, but I think that the key is not giving forecasters more frequent training. What I think is important to our forecasters’ performance is the fact that our forecasters use their skills every week, and get very concrete feedback about how good their forecasts were, including comparisons other people’s accuracies. This feedback allows and encourages people to keep learning.
Luke: I haven’t personally signed up for any of the ACE forecasting tournaments because I saw that the questions drew on very narrow domain knowledge — e.g. one SciCast question is “When will an operating graphene-based nano-antenna be demonstrated?” My sense was that even with 10 minutes of research into such a question, I wouldn’t be able to do much better than distributing my probability mass evenly across all available non-crazy answers. Or, if I was allowed to see others’ estimates, I wouldn’t be able to do better than than just copying the median response, even with 10 minutes of research on a single question.
For that reason, I’ve been thinking that the low-hanging fruit in mass calibration training would be to develop an app that feeds people questions for which many players could be expected to outperform random guessing (or exactly copying others) with mere seconds of thought — e.g. questions (with no Googling allowed) about basic science, or about which kinds of things tend to happen in normal human relationships, or about what happened in a famous historical event from the last 100 years.
Of course, that’s “retrodiction” rather than forecasting, but I suspect it would be useful calibration training nonetheless, and it could be more rewarding to engage with because it takes less time per question and participants could learn from their mistakes more quickly. This is the approach taken by many of the questions on the Center for Applied Rationality’s credence calibration game, though unfortunately that game currently has too few questions in its database (~1000, I think?), and too many of them are questions about historical sports outcomes, which are as obscure to non-sports-fans as the SciCast question about nano-antennas is to most people. (I had to tap “50% confident” for all those questions.)
One could even imagine it being gamified in various way, taking lessons from games like DragonBox, which feels like a game but is actually teaching kids algebra.
What do you think of my impressions about that? If regular practice is what likely makes the difference for people’s calibration, how could one plausibly create a scalable tool for calibration training (either retrodiction or forecasting) that people would actually want to use?
Lyle: First, let me clarify that the best performers on our Team Good Judgement competition are not people with specialized expertise, but people who work hard, collect lots of information and think carefully about it.
I like your idea of calibration training. I’m not sure how well performance on problems like sports betting or guessing the height of mount Everest generalize to real prediction problems. That’s a good question, and one that someone should test. My intuition is that many of the skills needed for good performance on problems like geo-political forecasting (e.g. picking a good reference class of events and using base rates from those as a starting point for a forecast) are quite different from the skills needed for retrodiction “guessing games”, but perhaps calibration would generalize. Or perhaps not.
Luke: How much calendar time is there between when the forecasts are made and when the forecasted events occur?
Lyle: We forecast events that range from one week to one year in the future. Predicting events months in the future is good time frame, since one can start with situations where the outcome is unclear, observe how probability estimates change as the world evolves, and also see what the actual outcome is.
An important aspect of our forecasting competition is that we make estimates every day about the probabilities of the future events. Individual forecasters, of course, update less frequently (they all have day jobs), but we evaluate people on their average daily accuracy — and we combine their individual forecasts to get an daily update on our aggregate estimate of how likely each future event is.
Luke: What are the prospects, do you think, for a similar research project investigating forecasts of events that range from 2-5 years in the future? Would you expect the “super forecasters” in the current project to show similar performance on forecasts with longer time horizons?
Lyle: In general, forecasting farther in the future is harder. (Think of predicting election outcomes; it’s much easier to prediction an election outcome as the election date gets closer.) Our super-forecasters are super, but not magic, so they will tend to be less accurate about long-range predictions. What will life be like in a hundred years? That’s probably a job for a futurist or science fiction writer, not a forecaster.
I don’t think many funders will have the patience to wait five years to see how good our (or anyone’s) forecasting methods are. A more promising direction, which we are pursuing, is to create clusters of questions. Some will be longer term, or perhaps even poorly specified (“Is China getting more aggressive?”). Others will be shorter term, but correlated with the longer term outcomes. Then we can estimate changes in probabilities of long-term or vague questions based on shorter term, clearly resolvable ones.
Luke: Years ago, you also wrote a review article on forecasting with neural nets. If you were given a sufficient budget to forecast something, what heuristics would you use to decide which forecasting methods to use? When are neural nets vs. prediction markets vs. team-based forecasting vs. large computer models vs. other methods appropriate?
Lyle: Firstly, neural nets are just a very flexible class of equations used to fit data; I.e. they are a statistical estimation method. Modern versions of them (“deep neural nets”) are very popular now at companies like Google and Facebook, mostly for recognizing objects in images, and for speech recognition, and work great if one has lots of data on which to “train” them — to estimate the model with.
Which leads me to the answer to your question:
I think one can roughly characterize forecasting problems into categories — each requiring different forecasting methods — based, in part, on how much historical data is available.
Some problems, like the geo-political forecasting we are doing, require lots collection of information and human thought. Prediction markets and team-based forecasts both work well for sifting through the conflicting information about international events. Computer models mostly don’t work as well here – there isn’t a long enough track records of, say, elections or coups in Mali to fit a good statistical model, and it isn’t obvious what other countries are ‘similar.’
Other problems, like predicting energy usage in a given city on a given day, are well suited to statistical models (including neural nets). We know the factors that matter (day of the week, holiday or not, weather, and overall trends), and we have thousands of days of historical observation. Human intuition is not as going to beat computers on that problem.
Yet other classes of problems, like economic forecasting (what will the GDP of Germany be next year? What will unemployment in California be in two years) are somewhere in the middle. One can build big econometric models, but there is still human judgement about the factors that go into them. (What if Merkel changes her mind or Greece suddenly adopts austerity measures?) We don’t have enough historical data to accurately predict economic decisions of politicians.
The bottom line is that if you have lots of data and the world isn’t changing to much, you can use statistical methods. For questions with more uncertain, human experts become more important. Who will win the US election tomorrow? Plug the results of polls into a statistical model. Who will win the US election in a year? Check the Iowa prediction markets. Who will win the US election in five years? No one knows, but a team of experts might be your best bet.
Luke: Thanks, Lyle!