AIs ranked by IQ; AI passes 100 IQ for first…

Mar 5, 2024

When AIs are given "special accommodations" in an IQ test, as if they were blind people, their scores improve

42 Comments

Mar 5, 2024

Sorry if the question is stupid, but did you use Claude 3 Opus or Sonnet (https://twitter.com/anthropicai/status/1764653830468428150)?

Expand full comment

Reply (1)

Maxim Lott

Mar 5, 2024

Not a stupid question: Opus. Which is supposed to be the smartest one.

Expand full comment

Alain Rossmann

Mar 18, 2024

I run the leaked version of Mistral Medium on a dual GPU computer at home. Would love to test it because it is the best model that can be run locally on a high-end consumer grade PC.

Expand full comment

Reply (1)

Maxim Lott

Mar 19, 2024

Nice! I could send you the text questions I used. Alternatively, if you can wait a couple of weeks, Mistral is on my list to test as well.

Expand full comment

Timothy B. Lee

Mar 7, 2024

It's important to keep in mind the possibility of training set contamination when doing tests like this. For example, the authors of this study (https://arxiv.org/abs/2402.19450) replaced problems from a math benchmark with similar problems they made "from scratch" and found that the models were able to solve 50-80 percent fewer problems. The authors argue this was because because copies of the problems had been included (probably accidentally) in the training set and as a result the models simply memorized the answers to the question rather than learning generalized principles that would allow them to answer other questions. Something similar could be going on with these IQ tests, in which you wouldn't be testing what you think you're testing.

Expand full comment

Reply (1)

Maxim Lott

Mar 7, 2024

Thanks. Agreed that this is a potential problem, though even if it is the case, the analysis shows two things:

-- More modern models are dramatically better at querying the appropriate section of their "memory" (or they have more relevant memory to query.)

-- The top AIs are still getting the easy questions right, as opposed to the harder ones. I think this is some evidence that the questions are not in their training data, though it's far from proof. But what it does show conclusivley is that the AIs don't have a whole answer key readily available in their memory.

I do want to get my hands on an offline-only IQ test, though. Maybe a library somewhere...

Expand full comment

Reply (1)

Timothy B. Lee

Mar 7, 2024

The same was true of the MATH benchmark in that study I linked to before. GPT-4 went from solving ~25 percent of static MATH problems to ~12 percent of problems that were generated dynamically. I think you are measuring something real as far as differences between the various LLMs goes.

But it strikes me as a huge leap from that to "simple extrapolation of current growth rates suggested that Claude-6 would get all the IQ questions right, and be smarter than just about everyone, in about 4 - 10 years." Because an LLM with an IQ of 150 (as measured by possible-memorized IQ tests) might be very different from a human being with an IQ of 150 because human beings have much less capacity for rote memorization than an LLM.

Expand full comment

Reply (1)

Maxim Lott

Mar 7, 2024

Maybe so. I’ll report back when I get an offline test.

Expand full comment

Reply (1)

Brian Hockenmaier

Mar 10, 2024

Just want to voice how necessary this is. We need to get past the standard "this is just regurgitating training data" arguments so we can get to the meat of what to do about these new intelligences we're making.

Expand full comment

Fabrizio

Mar 7, 2024

Thanks for sharing such as nteresting topic!

Expand full comment

Tim Tyler

Mar 6, 2024

GPT-4 Turbo has significantly improved intelligence - over GPT-4. You might want to consider testing that.

Expand full comment

Reply (1)

Maxim Lott

Mar 6, 2024

Good to know, thank you.

Expand full comment

Jurij 🇺🇦

Mar 5, 2024

If you want to I can make a visual IQ test that would be transparent and fair for AIs. I have made verbal, trivia, matrice IQ tests. Hundreds of items and tested them on thousands of people. It would be very easy to make a test that all AIs can see visually as it seems to have issues understanding answer option rows. The Norway Mensa test is decent, but very imprecise as it only has a few items, and they are way too hard, and you don't have enough time. It's based on Raven's from 1932 and then a German speed test. The Raven's was not timed initially, but is now way too easy after Flynn effect. The speed test is great, but imprecise at IQ scores above 120.

Norway Mensa works via the timer. Yet AI doesn't really need time to solve things just data. So you can't compare it directly to single humans in a test made for a specific time format - maybe. But at any rate there are a ton of ideas to explore and untimed tests too. I suggest making a verbal test, trivia test, math test, logic test, and matrice test. And then compare IQ scores on all the tests to see what IQ AIs have in all aspects of testing. If you for example look at the most popular test in the world, WAIS, it will measure sub factors like verbal IQ, math skills, and visual perception. So not just the overall IQ. And AI should basically ace any verbal IQ test with clear answer choices. The issue is that it's actually like 50 IQ right now if you don't give it verbal instructions. By mixing verbal and non-verbal items you don't clearly know what part of the intelligence you are measuring.

Expand full comment

Reply (3)

Maxim Lott

Mar 7, 2024

Jurij, you can email me, maxim.lott@gmail.com

It would be great to discussing creating some new questions for these AIs further.

Best,

Maxim

Expand full comment

Maxim Lott

Mar 6, 2024

That would be quite interesting! I would be happy to run it if you make such a matrix test.

I think the most important thing for AI visual tests is that the shapes are very VISIBLE. For example, in my last post, when I gave AIs the visual test, they were relatively likely to get question 13 on Norway Mensa, which, while the logic pattern isn't ultra-simple, all the shapes and colors are unambiguous. (Big triangles, circles, squares.)

Expand full comment

Evo

Mar 6, 2024Edited

Seems very sensible. If we care about artificial general intelligence and we can't be sure g-loading of tests is similar to humans (e.g. verbal is going to be lower - https://emilkirkegaard.dk/en/2023/05/which-test-has-the-highest-g-loading/), then "making a verbal test, trivia test, math test, logic test, and matrice test" could pin down the best test for g.

Expand full comment

Reply (1)

Jurij 🇺🇦

Mar 6, 2024

For verbal tests it's crystalized intelligence. So something you have learned over many years. Which is why processing power and speed is not measured or needed here. Many old people have very high crystalized intelligence, but their fluid intelligence declined with years. So you are basically measuring how much they have learned over their life and don't use any timer for it. This is the only test AIs can pass. But just like old people it may for example not be able to learn much new stuff and get confused by technology or too many variables. And anything speed related is not even relevant for it.

That's why I think AIs fluid intelligence is in the mentally retarded range. It's savant just like Kim Peek who the movie Rain Man is based on. But obviously no one would have put Kim Peek in charge of a giant tech company or anything close to that. Hence revealing why we measure different factors in intelligence. Often in jobs you get a simple IQ test, but it will measure everything in one. They would never be interested in measuring your verbal IQ only unless you legit have to work alone on writing. And even so it would imply reading the test yourself. Opening it, reading instructions, filling out all fields carefully. AIs can't do that. It's like asking an old man about words without seeing if he can actually even take a test. Of course in this case AIs did solve verbal puzzles. But you can't know what intelligence they used for it as they have so much data saved that you can't know what data they use at any one time.

Expand full comment

Reply (1)

Luca

Mar 24, 2024

Probably that's true, but AIs have got so much "crystalized intelligence" that most of the times can mimic our "fluid intelligence".

Expand full comment

Citizen Penrose

Mar 5, 2024

Good post.

I'm not sure investing in AI companies is a good idea though, it's probably better to hedge your bets and make investments assuming AI doesn't amount to much, since the world where it does is so good/destroyed anyway.

Expand full comment

Reply (1)

Maxim Lott

Mar 5, 2024

Plausible! I guess it matters just how “post scarcity” one thinks a world with super-powerful friendly AI would be

Expand full comment

Milli

Mar 5, 2024

Any particular reason you didn't test "Le Chat Mistral" (https://mistral.ai/)?

It solves #2 (2/2) and fails on #27 (0/2) (using the instructions you provided).

Expand full comment

Reply (1)

Maxim Lott

Mar 5, 2024Edited

From my site TrackingAI.org, I've noticed it misunderstands a relatively high proportion of questions (compared to other AIs.) Obviously that's not a reason *not* to rank it, but it's why I deprioritized it here for the sake of time. But in the future, I'll make it more comprehensive.

Maybe it'd do better in French?

Expand full comment

Reply (1)

Maxim Lott

Mar 5, 2024

Happy to share the full verbalized set of questions with you, or anyone, if you'd like to replicate it.

Expand full comment

Reply (1)

Milli

Mar 5, 2024

That'd be great!

Expand full comment

name12345

Mar 5, 2024

This is very valuable, thank you.

Expand full comment

Reply (1)

Maxim Lott

Mar 5, 2024

Thank you!

Expand full comment

Tom

Mar 5, 2024

Thanks for all the work. This is the most fascinating Substack piece I've read in a long time. Please continue this work.

For what it's worth, the pattern I see in Exercise #27:

• the scissors rotate CW by 90° (left to right) in each row;

• the scissors rotate CCW by 90° (top to bottom) in each column; and

• obviously, the answer has to be wide open scissors.

Unfortunately, I didn't see this pattern initially (I picked B).

Expand full comment

Reply (1)

Maxim Lott

Mar 5, 2024

Thank you!

Expand full comment

twi

Mar 6, 2024

When describing your initial labeling mistake in exercise 2, you made another mistake: """

In the initial version of my last post, I correctly described the right answer (“a square with a triangle in it”)

""" should be "a square with a hollow diamond in it", there are no triangles in the exercise

Expand full comment

Jessie Westlake

Oct 21, 2024

I don't have your entire question set, but ChatGPT o1-mini seems to have gotten #27 correct on my first copy/paste:

"Based on the provided verbal description of the 3x3 grid puzzle, let’s analyze the patterns to determine the correct symbol for the empty third row, third column.

Direction Pattern:

1. Row 1: Left, Up, Right

2. Row 2: Down, Left, Up

3. Row 3: Right, Down, ?

Observing the direction pattern:

• Each subsequent row rotates the directions counter-clockwise by 90 degrees.

• Left becomes Down

• Up becomes Left

• Right becomes Up

Applying this to the third row:

• Right becomes Down

• Down becomes Left

• Therefore, the missing direction should be Left.

Gap Angle Pattern:

1. Column 1: 60°, 90°, 30°

2. Column 2: 90°, 30°, 60°

3. Column 3: 30°, 60°, ?

Observing the gap angle pattern:

• The gaps rotate in the sequence 60°, 90°, 30° across columns.

• Following this sequence, the next gap after 60° should be 90°.

Conclusion:

The missing symbol should:

• Point Leftwards

• Have a 90-degree gap

• The center of the angle points exactly leftward

Therefore, the correct answer is:

Option A: A symbol similar to the hands of the scissors pointing leftwards. The gap between two scissor hands is 90 degrees. The center of the angle points exactly leftward."

Expand full comment

Heterodork

Sep 17, 2024

I don't get how the training data encoded for the IQ questions in the open models. I'm assuming they wouldn't have gone through the conversion to verbal exercise you did, did they? How do they get knowledge of the images, is there also ai that can be used to describe images verbally. In short, what did it learn from to solve these types of puzzles?

Expand full comment

Reply (1)

Maxim Lott

Sep 17, 2024

It could be in the training data via the YouTube transcript. Or, through some forum discussing the problem in words, which I am unaware of.

Expand full comment

Cast

May 29, 2024

Can you send me the text questions you used on pastebin. I am going to try this with gpt4-o

Expand full comment

Rafael Peñaloza

Mar 14, 2024

A question. Since you did each test twice, did you notice any differences in the answers given by each model? That is: are models answering in a stable way, or they provide very different answers each time? I think this is something that needs to be tested, beyond the number of correct answers.

Expand full comment

Reply (1)

Maxim Lott

Mar 14, 2024

Check out blue bar charts further down in the post. You can see which questions it got right twice, vs once. There is at least some instability with very different answers, but not too much (at least in Claude 3; Claude 2 had more.)

Expand full comment

Reply (1)

Rafael Peñaloza

Mar 14, 2024

Thank you. I always struggle with the randomness in some of these models

Expand full comment