HomeArticle

The AI version of the Wolf of Wall Street: o3-mini has made a staggering 9-fold profit through "divine bets", and DeepSeek R1 is the most unconventional.

新智元2025-08-18 14:57
China's men's basketball team lost the Asian Cup by one point! Can AI calculate that?

Can AI predict the future like the Oracle in science fiction movies? A brand - new benchmark test called "Prophet Arena" is evaluating the "prophetic" ability of AI by predicting real - world events.

Can AI predict the future?

In "The Matrix", the Oracle could predict Neo's future.

AI represented by ChatGPT can " predict the next token" based on past corpora.

The question then is, can AI, like the Oracle, find clues from the world's messy information and accurately predict the future?

For example:

Will AI regulation become federal law this year?

Who will win in a Major League Soccer game in the US?

Who will be the NBA champion this year?

In last night's men's basketball Asian Cup championship final, the Chinese men's basketball team lost to Australia by one point, but it was still the best result in the past decade!

I believe most people wouldn't have guessed this score. So, can AI predict it in advance based on the Chinese team's previous performance?

Furthermore, can AI, like Laplace's demon, accurately predict everything in the future after obtaining all the information about the current world?

If it can know the positions and velocities of all particles in the universe at a certain moment and fully understand the laws of nature.

Then, it can accurately calculate everything in the past and precisely predict everything in the future.

The Prophet Arena introduced today is a benchmark test that evaluates the prediction intelligence of AI systems through real - world prediction tasks with real - time updates.

By combining market consensus, automated prediction, information organization, and community insights, it forms stronger overall prediction ability.

To put it simply, Prophet Arena is unique as a benchmark test:

It tests prediction ability: This is an advanced form of intelligence that requires comprehensive understanding and reasoning abilities.

It is designed for "human - machine collaboration": You can provide clues to AI and see how its predictions change; AI will also tell you its thinking process.

It won't overfit, and the data will never be outdated: Because future events are always brand - new test questions.

It faces the real world: AI's predictions are directly linked to real betting decisions. Well - performing models can really make money in the virtual market.

Prophet Arena, relying on real - time prediction of market events, has established a dynamic benchmark that cannot be "memorized for exams" for the first time.

It comprehensively measures AI's performance in uncertain reasoning, information integration, probability prediction, and real - world returns.

Even Noam Brown, the head of OpenAI's reasoning research (AI reasoning research), exclaimed that prediction ability is a unique human ability, and now AI has finally started to get involved.

Rules of the Arena Unveiled

In Prophet Arena, AI models have to answer a simple yet fundamental question:

Can they predict events that haven't happened in the real world?

Prophet Arena selects popular, diverse, and periodic real - world events from prediction market platforms like Kalshi and Polymarket as test questions.

Kalshi is a US financial exchange and prediction market platform. It is the first US exchange focused on trading "event outcomes" regulated by the US Commodity Futures Trading Commission (CFTC).

Prediction topics related to AI on Polymarket

The entire competition process is divided into three steps:

1. Intelligence Gathering

AI models use search engines to collect news reports about an event like detectives and compile them into a concise "intelligence briefing". At the same time, they also include the current market price (which can be regarded as the collective wisdom of the masses).

2. Submitting Predictions

After receiving the same intelligence, each AI model has to submit a detailed "prediction report": give a probability distribution for all possible outcomes and attach a long - winded explanation of why it thinks so.

3. Result Announcement and Scoring

When the event ends and the result is announced, a set of professional indicators will be used to evaluate how accurate the AI's prediction is, and then the results will be updated on a real - time leaderboard.

Setting of Prediction Indicators

The leaderboard mainly looks at two indicators: one is the Brier score (the higher, the better) which measures accuracy and calibration, and the other is the average return of simulated real - world betting (to see who can make money).

In addition to the above two core indicators, Prophet Arena also adopts advanced evaluation methods inspired by statistics and psychometric modeling, such as Item Response Theory (IRT) and the generalized Bradley - Terry (BT) model.

These supplementary indicators enrich the leaderboard and enable a more detailed and comprehensive understanding of prediction intelligence.

Report Card of AI "Prophets" Released

Secret Discoveries of Prophet

You might think that the more accurate the prediction, the more money you'll make, right?

Most of the time, this is true, but an interesting " reversal zone" was found in the data.

Secret 1: The most profitable predictions aren't necessarily the most accurate

In the range where the Brier score is not high (0.3 - 0.5), many predictions with astonishing returns were born.

Digging deeper, it was found that many of them came from underdog sports events.

For example, in a Wimbledon tennis match, the market generally believed that player Paul had an 84% chance of winning before the game, and this probability even climbed to 95% just before the start.

But many AI models were more conservative than the market and only gave him about an 80% chance of winning.

It was this slight difference that made the model think that betting on the opponent, Orfner, to win had a higher "cost - performance ratio" when placing bets.

As a result, Orfner really won unexpectedly! This bet brought nearly a 6 - fold return.

You see, the AI didn't accurately predict the winner, so its accuracy score (Brier score) was just average.

But it keenly noticed the "pricing deviation" in the market and made a high - return choice.

This shows that being an accurate prophet and being a profitable investor are two different skills.

To explore this point, the composition of models in each Brier score range was examined, with each model represented by a different color.

A direct observation is that there are usually more predictions in the higher Brier score range.

The vast majority of LLMs tend to align with mainstream information when making predictions, so most predictions are concentrated in the high Brier score range.

Secret 2: AI has a "personality", either radical or conservative

Facing the same information, different AI models will show completely different "personalities".

For example, in the event of "Will AI regulatory laws become federal law before 2026?", the market thought the probability was only 25%.

But the models are much more radical than humans.

Representative of the radical faction, Qwen3: Seeing that various bills were advancing, it thought the momentum was strong and directly gave a sky - high probability of 75%.

Representative of the conservative faction, Llama 4 Maverick: It saw the same information but thought the legislative process was complex and slow, so it only gave a probability of 35%, slightly higher than the market.

And GPT - 4.1 was in between, giving a probability of 60%.

This is so interesting!

AI's predictions are not random. They have structured reasoning and unique risk preferences, just like human experts can have different opinions.

Secret 3: The secret to AI's victory lies in "winning big" rather than "winning often"

Among these models, which one can make the most money?

On the leaderboard, OpenAI's