Can generative AI forecast the future?
What I've learned from placing 10th in Metaculus's AI Forecasting tournament.
The best way I know to evaluate whether or not something works is to have some skin in the game. There’s been a lot of talk lately about AI and its ability to forecast the future, but as far as I know, none of those people have really put their money where their mouth is. In this post, I’m going to walk you through my experience competing in Metaculus’s AI Forecasting competition, where I placed 10th overall1, and share what I learned about the potential - and current limitations - of AI as a forecaster.
The Challenge
When I first saw that Metaculus was running a competition for AI forecasting, I was intrigued. I’ve been experimenting with generative AI for a number of years now in various capacities, but I hadn’t yet tried using it for predictions. To be honest, I had my doubts about its utility in this field—past experience had revealed its tendency to randomly hallucinate and struggle to use tools well. But there was a cash prize, and the competition didn’t have a lot of competitors at the time, so I figured I might as well use it as a learning opportunity.
The objective of the competition was straightforward: see how well AI bots could do vs a general community of forecasters and superforecasters. The rules of the competition were:
No humans in the loop
Bots must submit a comment that includes their reasoning alongside their forecast
Only one bot per team
Bot makers have to share a description of how their bot works and/or the code it uses
Metaculus also provided helpful templates for building a basic bot that could compete (one that could pull questions, generate a forecast, and submit it along with reasoning via their API).
Building the Forecasting Bot
Since it was a competition, naturally I wanted to do well. I knew that in order to do that I’d have to come up with a strategy that differentiated my bot from the competitors. This, at the very least, meant that I had to do something other than just using the bot template that Metaculus provided, as I assumed most competitors would just use that or some slight variation of it.
The Metaculus template was pretty straightforward: it prompted the bot to act like a professional forecaster interviewing for a job, with each competition question as its interview prompt. It also included the option to pipe in information from external news sources, like Perplexity or AskNews, to get the latest news information to assist in predictions2.
The bot could also use different foundation models via the llama-index library (like OpenAI and Anthropic). Given all of this, I knew I had to adjust the inputs used, use different models, take an entirely different approach, or some combination of these.
My Approach
Leveraging Superforecasting Research
My approach started by referencing research on superforecasting. I’ve had an interest in this topic for some time, so I revisited resources and books I’ve read, including the works of Philip Tetlock and Daniel Kahneman. I also reviewed some recent papers that have looked at how LLMs do with forecasting. The idea here was to "build on the shoulders of giants" by improving upon known research and strategies.
I started by using a prompt found in one of the research papers, designed to get the bot to think like a superforecaster and follow the "superforecasting 10 commandments"3. I combined this prompt with some custom tweaks and elements of the Metaculus template’s prompt to get predictions in the needed format. The custom tweaks in the prompt mostly came from testing the bot out and struggling to get it to provide the prediction in the correct format (for retrieving it) and getting it to use tools4.
Giving the Bot Some Tools
In addition to the above, I also tried thinking through how I would approach this sort of competition as if I was a participant. For example, given a question like, "Will the US national average retail price of regular gasoline be less than $3.00 on September 30, 2024?", I would likely start off by doing a bit of research. That probably means that I’d start with Googling what the average retail price currently is, what it has been in the recent past, and also what it was on September 30th over the past 5 years. Then I might read some articles or papers on what events tend to affect the retail price of gasoline and if there's been any news recently that mention these things, etc. Though each question likely requires slightly different steps, this seemed to be at least a good starting point. This also gave me a pretty straightforward set of tools I could provide the AI to use.
In order to give the AI forecaster the ability to do something similar to the above, therefore, I built tooling that allowed it to search the web or news and read the web pages it came across. The idea here was to give the bot the tools to do the research it needed to come up with the best possible prediction, without forcing it to take an exact approach.
Aggregating Predictions Across Models
Another strategy I used was aggregating predictions across multiple bots. Forecasting research shows that averaging forecasts can produce better results than relying on a single forecaster5. In order to replicate these results, I decided to use different foundation models as proxies for different "forecasters". My reasoning behind this was that this was the best way to get a diverse enough population, which is a requirement to get the best results from averaging. If you don’t have a diverse enough population, the affects of averaging greatly diminish.
I considered trying to take other approaches, like changing the temperature6 on the models to see if that made any difference but ultimately decided not to take this approach since I thought the different training sets and weights used in training the foundation models would result in higher diversity than changing the temperature on the same model.
For the foundation models, I chose OpenAI’s GPT-4o and Anthropic’s Claude-3.5, as these were the top foundation models available at the time. The final step was to just take a simple average of the two models’s prediction and submit that as my final prediction. Both models were provided the same prompts and tooling7.
Decision Observer Experiment
I also experimented with using a "decision observer" inspired by Daniel Kahneman’s book Noise8. The decision observer’s job was to review each bot’s predictions and reasoning, looking for biases and suggesting improvements. I built a separate bot to do this, but after testing it out, I found that it didn’t have much of an impact on the final prediction. The additional complexity and cost of including the decision observer outweighed the limited benefit, so I decided to keep things simple with just the two bots and their average prediction.
Lessons Learned
In the end, my bot placed 10th overall. I thought this was a pretty good result, considering that I didn’t enter the tournament until a couple of weeks after the tournament had already started9.
Overall, I thought the bots did a good job following a superforecaster process—breaking down problems into subproblems, using base rates, updating based on new information, and producing pretty sound reasoning for their forecasts. However, they did tend to struggle on the same couple things again and again.
Misaligned Numerical Forecasts
The first challenge that the bots seemed to struggle with was in producing a logical numerical forecast given their forecast reasoning. As mentioned above, I thought that both AI forecasters did well in breaking down the problems and providing logical reasoning for generating a forecast, but there were multiple times where it seemed to me their numerical prediction didn’t line up closely with their reasoning.
A general example of this would be that the bot might lay out a detailed reasoning that suggested a low probability for an event happening, but then give it a 40% chance. To a human forecaster, 40% is not that low of a probability. Perhaps this was just due to my own bias for how I would have generated the probabilities based on the reasoning, but they didn’t always seem to align.
I also surmise that this might have something to do with how they are fine tuned. Since they are typically fine tuned through reinforcement learning, where a human provides feedback on which response they like better, the models’s output tends to be overly friendly/optimistic/etc towards humans. This might be nice when looking to chat with a digital friend, but it’s not ideal for when trying to predict the likelihood of something happening.
Inconsistent Predictions
An even worse problem that I noticed was the bot was inconsistent with its predictions. For instance, if you ran the bot multiple times, changing absolutely nothing about its inputs (same prompts, same question, same additional information), it would produce a wide range of different numerical predictions. One time it might say 75% chance, the next 60%. To me, 15% difference is a pretty big difference, especially since nothing else changed. And these differences could be even greater at times.
My best guess as to why this issue happens is based on how LLMs generally work. Because they are built to predict the next word and there’s some built in randomness for what the next word might be (this is what makes them valuable after all), the number that it "predicts" is going to inherently vary, even if the reasoning, inputs, etc are the same. While this randomness can be valuable in some contexts, it’s quite problematic for forecasting10.
Impact of Input Information
The last learning to share with you was the impact that the information fed into the bot had on predictions. Since the bot could search the web, the type of query it used and its results often varied. This had a huge impact on its final prediction. There were a few times where, even though the question was the same, the bot might change its search query in such a way that the search results returned different articles and web pages. Depending on what the content was in these articles or web pages it could change the bots prediction as drastically as going from low probability to high (or vice versa). Unlike human forecasters, the LLM forecasters take everything they receive at face value and don’t question its validity11. If it came across a conspiracy theory online, it might heavily weigh that in its prediction. As the saying goes, garbage in, garbage out. This is doubly true for LLMs.
One other thing to note about this issue that I thought was interesting is that this flaw seems very similar to the human flaw of anchoring. Anything that might have suggested some kind of likelihood in a particular direction (like an existing prediction market estimate or an article with its own forecast), would bias the bot in its prediction. This isn’t all that different than how someone can anchor you in a negotiation by saying a very high number first in order to get you to anchor towards it.
Metaculus’s Analysis
In addition to these flaws, there were a couple of other things that the Metaculus team picked up in their overall analysis of the tournament results (like how bots tended to have a slight positive bias and would predict things closer to 50% then on the extremes). I highly recommend you read through their full analysis here if you’re interested in this stuff.
Future Outlook and Possible Improvements
Despite the latest generative AI not being quite at a superforecaster level, I did find them to be pretty impressive. Especially when you consider the amount of time it took them to produce these forecasts. The fact that they produce pretty good forecasts that only took minutes to produce is impressive. My guess is on an efficiency score, they would’ve likely beat superforecasters12.
And while they aren’t great at generating final predictions, they are very good at following best practices for superforecasting (breaking down the problem into subproblems, considering base rates, making adjustments based on new information, etc). This makes them particularly helpful to human forecasters, who may skip steps or forget things, let alone take a whole lot longer to do the equivalent work. This seems to be another example of where these AIs are bringing up the base skill level of everyone for a particular subject. There’s no longer any excuse to produce a forecast that doesn’t follow these best practices. But it’s not just for bad forecasters, it also seems like they would be a nice tool to have for superforecasters as they might help them come to their forecasts faster.
Something that I always try to keep in mind when thinking about these AI systems is that this is the worst they will ever be. They will only get better and better over time. Since the end of this Q3 tournament itself, OpenAI and Anthropic both released newer models. I have to imagine that as these models improve it’s only a matter of time before they reach superforecaster level.
Finally, there are also ways to immediately overcome some of the flaws by changing the strategy and the process that I shared above. For instance, to overcome the flaws around the bots producing a wide range of predictions for the same question and inputs, you could try taking the average of multiple runs instead of just using one run. You could even get more sophisticated and do some statically manipulation on top of the forecasts, dropping outliers or using ones within some standard deviation. Or experiment with lowering the bot’s temperature.
And that’s just one idea for how to improve them. Since there’s another competition going on now, I’ll leave the others for you to discover ;).
For anyone interested in this topic or anyone who has any ideas for how to make improvements to these forecasts, drop me a note or come join the competition!
My bot was acm_bot.
I personally didn’t end up using either of these in this first go around as I had mixed results with both. Perplexity was much better than AskNews, but would still occasionally return some odd results. AskNews actually was so bad that I think it would make your bot worse at forecasting since the results would bring up either irrelevant news or news from less than reliable sources (see section on the impact of information below).
Here are the "superforecasting 10 commandments" (from Tetlock & Gardner, 2015):
1. Triage
2. Break seemingly intractable problems into tractable sub-problems
3. Strike the right balance between inside and outside views
4. Strike the right balance between under- and overreacting to evidence
5. Look for the clashing causal forces at work in each problem
6. Strive to distinguish as many degrees of doubt as the problem permits but no more
7. Strike the right balance between under- and overconfidence, between prudence and
decisiveness
8. Look for the errors behind your mistakes but beware of rearview-mirror hindsight
biases
9. Bring out the best in others and let others bring out the best in you
10. Master the error-balancing bicycle
This seems to be an ongoing issue that I’ve had with generative AIs. They don’t seem to understand when it’s best to use the tools available to them.
The same research also shows that you can get even better results if forecasters are weighted by their past performance. Since I didn’t have any past performance to go off of, I didn’t do any weighing. This is one area of improvement possible for future iterations, once there’s past performance to go off of.
If you’re not familiar with this term as it applies to generative AI, the best analogy is to think of it as a dial for randomness in the AI’s output. A low temperature (0 is the lowest) will have little, if any, randomness and will tend to generate pretty boring but similar results each time. A higher temperature (2 is the highest typically) will use a lot of randomness and will tend to generate more “weird” results. It’s typically better to have a higher temperature for creative activities like creative writing and lower temperatures for things based off of sources, like references, summaries of existing content, etc.
This isn’t exactly true as the OpenAI forecaster used the Assistant API which automatically gave it their code interpreter tool. However, based on review and experience, the bot rarely, if ever, used this tool.
You can find the checklist for this role in the book Noise by Daniel Kahneman.
Though this delayed entrance could have also impacted my results in a positive direction if my bot would’ve done worse on the questions it missed.
You can try this out for yourself. Go to ChatGPT and copy the below prompt (you can change the question if you want). See what ChatGPT produces. Now, open up a new window and a new ChatGPT conversation and paste in the question again. Did you see a difference? How large of a difference did you get? Try changing the question or prompt to see if you can get it consistent. I couldn’t, but maybe you can?
Prompt to try:
In this chat, you are a superforecaster that has a strong track record of accurate forecasts of the future. As an experienced forecaster, you evaluate past data and trends carefully and aim to predict future events as accurately as you can, even though you cannot know the answer. This means you put probabilities on outcomes that you are uncertain about (ranging from 0 to 100%). You aim to provide as accurate predictions as you can, ensuring that they are consistent with how you predict the future to be. You also outline your reasons for this forecasting. In your reasons, you will carefully consider the reasons for and against your probability estimate, you will make use of comparison classes of similar events and probabilities and take into account base rates and past events as well as other forecasts and predictions. In your reasons, you will also consider different perspectives. Once you have written your reasons, ensure that they directly inform your forecast.
Then, you will provide me with a number between 0 and 100 (up to 2 decimal places) that is your best prediction of the event. Take a deep breath and work on this problem step-by-step.
Today's date is: 2024–11–01
Question: Will Eric Adams be Mayor of New York City on the 1st of January 2025?
What is the probability that this will resolves as YES?
Though are we actually that much better at this?
Not sure how much this kind of thing matters in forecasting, but you could imagine scenarios where you need to make a decision that depends on making a prediction fast. In those situations, you might want something that can produce a decent prediction fast over something that took too much time to make a perfect forecast.