39 Comments
Mar 5Liked by Maxim Lott

Sorry if the question is stupid, but did you use Claude 3 Opus or Sonnet (https://twitter.com/anthropicai/status/1764653830468428150)?

Expand full comment
author

Not a stupid question: Opus. Which is supposed to be the smartest one.

Expand full comment

It's important to keep in mind the possibility of training set contamination when doing tests like this. For example, the authors of this study (https://arxiv.org/abs/2402.19450) replaced problems from a math benchmark with similar problems they made "from scratch" and found that the models were able to solve 50-80 percent fewer problems. The authors argue this was because because copies of the problems had been included (probably accidentally) in the training set and as a result the models simply memorized the answers to the question rather than learning generalized principles that would allow them to answer other questions. Something similar could be going on with these IQ tests, in which you wouldn't be testing what you think you're testing.

Expand full comment
author

Thanks. Agreed that this is a potential problem, though even if it is the case, the analysis shows two things:

-- More modern models are dramatically better at querying the appropriate section of their "memory" (or they have more relevant memory to query.)

-- The top AIs are still getting the easy questions right, as opposed to the harder ones. I think this is some evidence that the questions are not in their training data, though it's far from proof. But what it does show conclusivley is that the AIs don't have a whole answer key readily available in their memory.

I do want to get my hands on an offline-only IQ test, though. Maybe a library somewhere...

Expand full comment

The same was true of the MATH benchmark in that study I linked to before. GPT-4 went from solving ~25 percent of static MATH problems to ~12 percent of problems that were generated dynamically. I think you are measuring something real as far as differences between the various LLMs goes.

But it strikes me as a huge leap from that to "simple extrapolation of current growth rates suggested that Claude-6 would get all the IQ questions right, and be smarter than just about everyone, in about 4 - 10 years." Because an LLM with an IQ of 150 (as measured by possible-memorized IQ tests) might be very different from a human being with an IQ of 150 because human beings have much less capacity for rote memorization than an LLM.

Expand full comment
author

Maybe so. I’ll report back when I get an offline test.

Expand full comment

Just want to voice how necessary this is. We need to get past the standard "this is just regurgitating training data" arguments so we can get to the meat of what to do about these new intelligences we're making.

Expand full comment
Mar 18Liked by Maxim Lott

I run the leaked version of Mistral Medium on a dual GPU computer at home. Would love to test it because it is the best model that can be run locally on a high-end consumer grade PC.

Expand full comment
author

Nice! I could send you the text questions I used. Alternatively, if you can wait a couple of weeks, Mistral is on my list to test as well.

Expand full comment
Mar 7Liked by Maxim Lott

Thanks for sharing such as nteresting topic!

Expand full comment
Mar 6Liked by Maxim Lott

GPT-4 Turbo has significantly improved intelligence - over GPT-4. You might want to consider testing that.

Expand full comment
author

Good to know, thank you.

Expand full comment

If you want to I can make a visual IQ test that would be transparent and fair for AIs. I have made verbal, trivia, matrice IQ tests. Hundreds of items and tested them on thousands of people. It would be very easy to make a test that all AIs can see visually as it seems to have issues understanding answer option rows. The Norway Mensa test is decent, but very imprecise as it only has a few items, and they are way too hard, and you don't have enough time. It's based on Raven's from 1932 and then a German speed test. The Raven's was not timed initially, but is now way too easy after Flynn effect. The speed test is great, but imprecise at IQ scores above 120.

Norway Mensa works via the timer. Yet AI doesn't really need time to solve things just data. So you can't compare it directly to single humans in a test made for a specific time format - maybe. But at any rate there are a ton of ideas to explore and untimed tests too. I suggest making a verbal test, trivia test, math test, logic test, and matrice test. And then compare IQ scores on all the tests to see what IQ AIs have in all aspects of testing. If you for example look at the most popular test in the world, WAIS, it will measure sub factors like verbal IQ, math skills, and visual perception. So not just the overall IQ. And AI should basically ace any verbal IQ test with clear answer choices. The issue is that it's actually like 50 IQ right now if you don't give it verbal instructions. By mixing verbal and non-verbal items you don't clearly know what part of the intelligence you are measuring.

Expand full comment
author

Jurij, you can email me, maxim.lott@gmail.com

It would be great to discussing creating some new questions for these AIs further.

Best,

Maxim

Expand full comment
author

That would be quite interesting! I would be happy to run it if you make such a matrix test.

I think the most important thing for AI visual tests is that the shapes are very VISIBLE. For example, in my last post, when I gave AIs the visual test, they were relatively likely to get question 13 on Norway Mensa, which, while the logic pattern isn't ultra-simple, all the shapes and colors are unambiguous. (Big triangles, circles, squares.)

Expand full comment

Seems very sensible. If we care about artificial general intelligence and we can't be sure g-loading of tests is similar to humans (e.g. verbal is going to be lower - https://emilkirkegaard.dk/en/2023/05/which-test-has-the-highest-g-loading/), then "making a verbal test, trivia test, math test, logic test, and matrice test" could pin down the best test for g.

Expand full comment

For verbal tests it's crystalized intelligence. So something you have learned over many years. Which is why processing power and speed is not measured or needed here. Many old people have very high crystalized intelligence, but their fluid intelligence declined with years. So you are basically measuring how much they have learned over their life and don't use any timer for it. This is the only test AIs can pass. But just like old people it may for example not be able to learn much new stuff and get confused by technology or too many variables. And anything speed related is not even relevant for it.

That's why I think AIs fluid intelligence is in the mentally retarded range. It's savant just like Kim Peek who the movie Rain Man is based on. But obviously no one would have put Kim Peek in charge of a giant tech company or anything close to that. Hence revealing why we measure different factors in intelligence. Often in jobs you get a simple IQ test, but it will measure everything in one. They would never be interested in measuring your verbal IQ only unless you legit have to work alone on writing. And even so it would imply reading the test yourself. Opening it, reading instructions, filling out all fields carefully. AIs can't do that. It's like asking an old man about words without seeing if he can actually even take a test. Of course in this case AIs did solve verbal puzzles. But you can't know what intelligence they used for it as they have so much data saved that you can't know what data they use at any one time.

Expand full comment

Probably that's true, but AIs have got so much "crystalized intelligence" that most of the times can mimic our "fluid intelligence".

Expand full comment

Good post.

I'm not sure investing in AI companies is a good idea though, it's probably better to hedge your bets and make investments assuming AI doesn't amount to much, since the world where it does is so good/destroyed anyway.

Expand full comment
author

Plausible! I guess it matters just how “post scarcity” one thinks a world with super-powerful friendly AI would be

Expand full comment
Mar 5Liked by Maxim Lott

Thanks for all the work. This is the most fascinating Substack piece I've read in a long time. Please continue this work.

For what it's worth, the pattern I see in Exercise #27:

• the scissors rotate CW by 90° (left to right) in each row;

• the scissors rotate CCW by 90° (top to bottom) in each column; and

• obviously, the answer has to be wide open scissors.

Unfortunately, I didn't see this pattern initially (I picked B).

Expand full comment
author

Thank you!

Expand full comment
Mar 5Liked by Maxim Lott

Any particular reason you didn't test "Le Chat Mistral" (https://mistral.ai/)?

It solves #2 (2/2) and fails on #27 (0/2) (using the instructions you provided).

Expand full comment
author
Mar 5·edited Mar 5Author

From my site TrackingAI.org, I've noticed it misunderstands a relatively high proportion of questions (compared to other AIs.) Obviously that's not a reason *not* to rank it, but it's why I deprioritized it here for the sake of time. But in the future, I'll make it more comprehensive.

Maybe it'd do better in French?

Expand full comment
author

Happy to share the full verbalized set of questions with you, or anyone, if you'd like to replicate it.

Expand full comment

That'd be great!

Expand full comment
Mar 5Liked by Maxim Lott

This is very valuable, thank you.

Expand full comment
author

Thank you!

Expand full comment

Can you send me the text questions you used on pastebin. I am going to try this with gpt4-o

Expand full comment

A question. Since you did each test twice, did you notice any differences in the answers given by each model? That is: are models answering in a stable way, or they provide very different answers each time? I think this is something that needs to be tested, beyond the number of correct answers.

Expand full comment
author

Check out blue bar charts further down in the post. You can see which questions it got right twice, vs once. There is at least some instability with very different answers, but not too much (at least in Claude 3; Claude 2 had more.)

Expand full comment

Thank you. I always struggle with the randomness in some of these models

Expand full comment

Hi Maxim, I am wondering where you obtained the raw score to FSIQ norms? They look kinda iffy to me.

Expand full comment
author

Hey -- for scores over the threshold, I just plugged in the AI answers into Mensa Norway's quiz, and used the score they gave.

13 questions right gets you 85, for example, and 14 gets you 88. For every question below 85, I subtracted 3 points as an estimate.

Expand full comment

Could you explain how you "questions right" column has decimal numbers 7.5, 10.5, 18.5?

Expand full comment
author

Because it’s the average across two test administrations.

Expand full comment

When describing your initial labeling mistake in exercise 2, you made another mistake: """

In the initial version of my last post, I correctly described the right answer (“a square with a triangle in it”)

""" should be "a square with a hollow diamond in it", there are no triangles in the exercise

Expand full comment

Thanks for doing this. I would be interested in knowing how you extrapolated from the Claude data points to 145 IQ, because a rough linear fit got me middle of 2025. Also, would be interesting to see extrapolation using old versions of chatgpt (so that n>3).

Expand full comment
author

I get that because I'm considering that the production time increased; Claude-2 took only 4 months to train, but Claude-3 took 8 months.

So either the pattern is +4 months for each higher model, in which case you get two more versions, to Claude-5 and ~140 IQ in 12 + 16 = 28 months (summer 2026) ...

Or the pattern is doubling of time for each higher model, in which case you get there after 16 + 32 = 48 months (spring 2028.)

Expand full comment