I've been tracking AI IQs for about a year on my site TrackingAI.org, so I can see that AI is progressing rapidly.
ChatGPT’s latest paid model, o3, was released this week. It scored a stunning 136 on the Norway Mensa IQ test:
An IQ of 136 is in the top 1% for humans.
In contrast, here's how the leading AIs scored on the exact same test just 11 months ago, in May:
Incredible progress.
The one caveat is that the results are based on the Norway Mensa IQ test, which is public, and for which answers do exist online (though they are not particularly easy to find.)
To deal with that, last year I worked with Mensa member Jurij to create an offline-only test from scratch (so it'd be out of the reach of AI training data) and then I made the scoring of the two tests roughly equivalent, determining question difficulty by asking Maximum Truth readers to take questions from both quizzes.
Here's how AIs do on the quiz that’s outside of AI datasets:
Now o3 scores an IQ of 116, putting it in the top 15% of humans. The median Maximum Truth reader, for comparison, scored 104.
And here are the leading AIs on the exact same test 11 months ago, back when we were better at IQ tests than AIs:
Again, this is simply incredible progress for 11 months.
I sent this info to Tyler Cowen of Marginal Revolution, and he further noted, “of course those tests underweight the value of extreme breadth, which of course o3 has.”
That is also correct.
Here is how I would put the findings:
ChatGPT’s o3 model is something like a person with 116 IQ when reasoning from scratch, but furthermore is like such a person with the entire world’s knowledge in its memory.
When it comes to reasoning on questions with already established answers, o3 appears more like a person with a 136 IQ, but this is largely due to having the entire world’s knowledge in its memory, including the work of the smartest humans, even if it could not solve such questions from scratch.
Conclusions
One still hears people say, “AIs can regurgitate knowledge, but they can’t think.”
That’s wrong.
AIs don’t feel things, but they can think, in that they can solve never-before-seen problems by deducing complex patterns.
I have more to report, including about AI vision progress and how it will allow AIs to interact with the physical world, and AI politics, but that’ll be for a future post.
Just like the internet was mostly ignored in 1995, AI even now remains dramatically under-covered considering the impact it will soon have.
The God Machine is on its way
Really appreciate this kind of longitudinal benchmarking—especially when the same test is repeated over time. That consistency is rare and valuable.
That said, it’s increasingly important to ask what we’re actually measuring as models start outperforming humans on IQ tests.
We’re still rewarding fluency under constraint—token-level pattern extension—not cognitive flexibility or causal reasoning. Rising scores may reflect emergent capabilities, but they also reveal growing prompt sensitivity and structural scaffolding around task framing.
I’ve been exploring a modular prompt framework (Lexome) to help separate prompt design from true model capability. Benchmarks like these are great, but we also need tests that hold structure constant to see what’s really improving.
Would love to hear from others thinking about uncertainty-aware evaluation or cognitive framing.