38 Comments
Mar 5Liked by Maxim Lott

Sorry if the question is stupid, but did you use Claude 3 Opus or Sonnet (https://twitter.com/anthropicai/status/1764653830468428150)?

Expand full comment

It's important to keep in mind the possibility of training set contamination when doing tests like this. For example, the authors of this study (https://arxiv.org/abs/2402.19450) replaced problems from a math benchmark with similar problems they made "from scratch" and found that the models were able to solve 50-80 percent fewer problems. The authors argue this was because because copies of the problems had been included (probably accidentally) in the training set and as a result the models simply memorized the answers to the question rather than learning generalized principles that would allow them to answer other questions. Something similar could be going on with these IQ tests, in which you wouldn't be testing what you think you're testing.

Expand full comment
Mar 18Liked by Maxim Lott

I run the leaked version of Mistral Medium on a dual GPU computer at home. Would love to test it because it is the best model that can be run locally on a high-end consumer grade PC.

Expand full comment
Mar 7Liked by Maxim Lott

Thanks for sharing such as nteresting topic!

Expand full comment
Mar 6Liked by Maxim Lott

GPT-4 Turbo has significantly improved intelligence - over GPT-4. You might want to consider testing that.

Expand full comment

If you want to I can make a visual IQ test that would be transparent and fair for AIs. I have made verbal, trivia, matrice IQ tests. Hundreds of items and tested them on thousands of people. It would be very easy to make a test that all AIs can see visually as it seems to have issues understanding answer option rows. The Norway Mensa test is decent, but very imprecise as it only has a few items, and they are way too hard, and you don't have enough time. It's based on Raven's from 1932 and then a German speed test. The Raven's was not timed initially, but is now way too easy after Flynn effect. The speed test is great, but imprecise at IQ scores above 120.

Norway Mensa works via the timer. Yet AI doesn't really need time to solve things just data. So you can't compare it directly to single humans in a test made for a specific time format - maybe. But at any rate there are a ton of ideas to explore and untimed tests too. I suggest making a verbal test, trivia test, math test, logic test, and matrice test. And then compare IQ scores on all the tests to see what IQ AIs have in all aspects of testing. If you for example look at the most popular test in the world, WAIS, it will measure sub factors like verbal IQ, math skills, and visual perception. So not just the overall IQ. And AI should basically ace any verbal IQ test with clear answer choices. The issue is that it's actually like 50 IQ right now if you don't give it verbal instructions. By mixing verbal and non-verbal items you don't clearly know what part of the intelligence you are measuring.

Expand full comment

Good post.

I'm not sure investing in AI companies is a good idea though, it's probably better to hedge your bets and make investments assuming AI doesn't amount to much, since the world where it does is so good/destroyed anyway.

Expand full comment
Mar 5Liked by Maxim Lott

Thanks for all the work. This is the most fascinating Substack piece I've read in a long time. Please continue this work.

For what it's worth, the pattern I see in Exercise #27:

• the scissors rotate CW by 90° (left to right) in each row;

• the scissors rotate CCW by 90° (top to bottom) in each column; and

• obviously, the answer has to be wide open scissors.

Unfortunately, I didn't see this pattern initially (I picked B).

Expand full comment
Mar 5Liked by Maxim Lott

Any particular reason you didn't test "Le Chat Mistral" (https://mistral.ai/)?

It solves #2 (2/2) and fails on #27 (0/2) (using the instructions you provided).

Expand full comment
Mar 5Liked by Maxim Lott

This is very valuable, thank you.

Expand full comment

A question. Since you did each test twice, did you notice any differences in the answers given by each model? That is: are models answering in a stable way, or they provide very different answers each time? I think this is something that needs to be tested, beyond the number of correct answers.

Expand full comment

Hi Maxim, I am wondering where you obtained the raw score to FSIQ norms? They look kinda iffy to me.

Expand full comment

Could you explain how you "questions right" column has decimal numbers 7.5, 10.5, 18.5?

Expand full comment

When describing your initial labeling mistake in exercise 2, you made another mistake: """

In the initial version of my last post, I correctly described the right answer (“a square with a triangle in it”)

""" should be "a square with a hollow diamond in it", there are no triangles in the exercise

Expand full comment

Thanks for doing this. I would be interested in knowing how you extrapolated from the Claude data points to 145 IQ, because a rough linear fit got me middle of 2025. Also, would be interesting to see extrapolation using old versions of chatgpt (so that n>3).

Expand full comment