Really appreciate this kind of longitudinal benchmarking—especially when the same test is repeated over time. That consistency is rare and valuable.
That said, it’s increasingly important to ask what we’re actually measuring as models start outperforming humans on IQ tests.
We’re still rewarding fluency under constraint—token-level pattern extension—not cognitive flexibility or causal reasoning. Rising scores may reflect emergent capabilities, but they also reveal growing prompt sensitivity and structural scaffolding around task framing.
I’ve been exploring a modular prompt framework (Lexome) to help separate prompt design from true model capability. Benchmarks like these are great, but we also need tests that hold structure constant to see what’s really improving.
Would love to hear from others thinking about uncertainty-aware evaluation or cognitive framing.
There's a medical reasoning case I came up with and have been giving to subsequent GPTs (and Claude and Gemini once in a while) to see how they develop in their reasoning abilities. o3 is clearly a big deal and blew the others out of the park. We're in for some interesting times.
I enjoy your site. One comment on its design : change the color of text now displayed in light yellow which makes vizualizatin difficult in most devices.
Awesome article as usual. I'm always looking forward to your updates about the IQs of AI models, and I can't wait for your next post about AI vision progress. I think it's only a matter of time before tech like self-driving cars and smartglasses really take off. The fundamentals are already here.
All his general points and detailed analysis seem reasonable to me. (One of my hobbies is machining small parts.) Like the author, a large part of my intelligence is spatial reasoning and visualization, not serial (verbal, coding, music, etc.).
I grant that AI is hugely successful with 1-D problems. But I think it’s infantile in the 3-D world.
We know what human feelings come from (eg, we know the sensations that dopamine leads to) and we know that we didn’t program that into LLMs — feelings are specific artifacts of biological evolution and needs.
LLMs don't have feelings in the same way humans do, but does that rule out LLMs having evolved their own analogue of feelings under training pressure? I don't think we can rule it out.
The God Machine is on its way
Better not overfeed it with orichalcum. Remember Plato's tenfold error!
Really appreciate this kind of longitudinal benchmarking—especially when the same test is repeated over time. That consistency is rare and valuable.
That said, it’s increasingly important to ask what we’re actually measuring as models start outperforming humans on IQ tests.
We’re still rewarding fluency under constraint—token-level pattern extension—not cognitive flexibility or causal reasoning. Rising scores may reflect emergent capabilities, but they also reveal growing prompt sensitivity and structural scaffolding around task framing.
I’ve been exploring a modular prompt framework (Lexome) to help separate prompt design from true model capability. Benchmarks like these are great, but we also need tests that hold structure constant to see what’s really improving.
Would love to hear from others thinking about uncertainty-aware evaluation or cognitive framing.
There's a medical reasoning case I came up with and have been giving to subsequent GPTs (and Claude and Gemini once in a while) to see how they develop in their reasoning abilities. o3 is clearly a big deal and blew the others out of the park. We're in for some interesting times.
I enjoy your site. One comment on its design : change the color of text now displayed in light yellow which makes vizualizatin difficult in most devices.
Thanks. Are you referring to Maximum Truth, or TrackingAI.org, or both?
Hi Maxim, another question - have you (further) updated your personal p(doom) recently?
Awesome article as usual. I'm always looking forward to your updates about the IQs of AI models, and I can't wait for your next post about AI vision progress. I think it's only a matter of time before tech like self-driving cars and smartglasses really take off. The fundamentals are already here.
I might argue with your conclusion. Are you still submitting the IQ tests as text descriptions? I would be interested in your take on this article:
https://adamkarvonen.github.io/machine_learning/2025/04/13/llm-manufacturing-eval.html.
All his general points and detailed analysis seem reasonable to me. (One of my hobbies is machining small parts.) Like the author, a large part of my intelligence is spatial reasoning and visualization, not serial (verbal, coding, music, etc.).
I grant that AI is hugely successful with 1-D problems. But I think it’s infantile in the 3-D world.
So what now then?
> AIs don’t feel things
Citation needed
We know what human feelings come from (eg, we know the sensations that dopamine leads to) and we know that we didn’t program that into LLMs — feelings are specific artifacts of biological evolution and needs.
LLMs don't have feelings in the same way humans do, but does that rule out LLMs having evolved their own analogue of feelings under training pressure? I don't think we can rule it out.