It's important to keep in mind the possibility of training set contamination when doing tests like this. For example, the authors of this study (https://arxiv.org/abs/2402.19450) replaced problems from a math benchmark with similar problems they made "from scratch" and found that the models were able to solve 50-80 percent fewer problems. The authors argue this was because because copies of the problems had been included (probably accidentally) in the training set and as a result the models simply memorized the answers to the question rather than learning generalized principles that would allow them to answer other questions. Something similar could be going on with these IQ tests, in which you wouldn't be testing what you think you're testing.
Thanks. Agreed that this is a potential problem, though even if it is the case, the analysis shows two things:
-- More modern models are dramatically better at querying the appropriate section of their "memory" (or they have more relevant memory to query.)
-- The top AIs are still getting the easy questions right, as opposed to the harder ones. I think this is some evidence that the questions are not in their training data, though it's far from proof. But what it does show conclusivley is that the AIs don't have a whole answer key readily available in their memory.
I do want to get my hands on an offline-only IQ test, though. Maybe a library somewhere...
The same was true of the MATH benchmark in that study I linked to before. GPT-4 went from solving ~25 percent of static MATH problems to ~12 percent of problems that were generated dynamically. I think you are measuring something real as far as differences between the various LLMs goes.
But it strikes me as a huge leap from that to "simple extrapolation of current growth rates suggested that Claude-6 would get all the IQ questions right, and be smarter than just about everyone, in about 4 - 10 years." Because an LLM with an IQ of 150 (as measured by possible-memorized IQ tests) might be very different from a human being with an IQ of 150 because human beings have much less capacity for rote memorization than an LLM.
Just want to voice how necessary this is. We need to get past the standard "this is just regurgitating training data" arguments so we can get to the meat of what to do about these new intelligences we're making.
I run the leaked version of Mistral Medium on a dual GPU computer at home. Would love to test it because it is the best model that can be run locally on a high-end consumer grade PC.
If you want to I can make a visual IQ test that would be transparent and fair for AIs. I have made verbal, trivia, matrice IQ tests. Hundreds of items and tested them on thousands of people. It would be very easy to make a test that all AIs can see visually as it seems to have issues understanding answer option rows. The Norway Mensa test is decent, but very imprecise as it only has a few items, and they are way too hard, and you don't have enough time. It's based on Raven's from 1932 and then a German speed test. The Raven's was not timed initially, but is now way too easy after Flynn effect. The speed test is great, but imprecise at IQ scores above 120.
Norway Mensa works via the timer. Yet AI doesn't really need time to solve things just data. So you can't compare it directly to single humans in a test made for a specific time format - maybe. But at any rate there are a ton of ideas to explore and untimed tests too. I suggest making a verbal test, trivia test, math test, logic test, and matrice test. And then compare IQ scores on all the tests to see what IQ AIs have in all aspects of testing. If you for example look at the most popular test in the world, WAIS, it will measure sub factors like verbal IQ, math skills, and visual perception. So not just the overall IQ. And AI should basically ace any verbal IQ test with clear answer choices. The issue is that it's actually like 50 IQ right now if you don't give it verbal instructions. By mixing verbal and non-verbal items you don't clearly know what part of the intelligence you are measuring.
That would be quite interesting! I would be happy to run it if you make such a matrix test.
I think the most important thing for AI visual tests is that the shapes are very VISIBLE. For example, in my last post, when I gave AIs the visual test, they were relatively likely to get question 13 on Norway Mensa, which, while the logic pattern isn't ultra-simple, all the shapes and colors are unambiguous. (Big triangles, circles, squares.)
Seems very sensible. If we care about artificial general intelligence and we can't be sure g-loading of tests is similar to humans (e.g. verbal is going to be lower - https://emilkirkegaard.dk/en/2023/05/which-test-has-the-highest-g-loading/), then "making a verbal test, trivia test, math test, logic test, and matrice test" could pin down the best test for g.
For verbal tests it's crystalized intelligence. So something you have learned over many years. Which is why processing power and speed is not measured or needed here. Many old people have very high crystalized intelligence, but their fluid intelligence declined with years. So you are basically measuring how much they have learned over their life and don't use any timer for it. This is the only test AIs can pass. But just like old people it may for example not be able to learn much new stuff and get confused by technology or too many variables. And anything speed related is not even relevant for it.
That's why I think AIs fluid intelligence is in the mentally retarded range. It's savant just like Kim Peek who the movie Rain Man is based on. But obviously no one would have put Kim Peek in charge of a giant tech company or anything close to that. Hence revealing why we measure different factors in intelligence. Often in jobs you get a simple IQ test, but it will measure everything in one. They would never be interested in measuring your verbal IQ only unless you legit have to work alone on writing. And even so it would imply reading the test yourself. Opening it, reading instructions, filling out all fields carefully. AIs can't do that. It's like asking an old man about words without seeing if he can actually even take a test. Of course in this case AIs did solve verbal puzzles. But you can't know what intelligence they used for it as they have so much data saved that you can't know what data they use at any one time.
I'm not sure investing in AI companies is a good idea though, it's probably better to hedge your bets and make investments assuming AI doesn't amount to much, since the world where it does is so good/destroyed anyway.
From my site TrackingAI.org, I've noticed it misunderstands a relatively high proportion of questions (compared to other AIs.) Obviously that's not a reason *not* to rank it, but it's why I deprioritized it here for the sake of time. But in the future, I'll make it more comprehensive.
I don't get how the training data encoded for the IQ questions in the open models. I'm assuming they wouldn't have gone through the conversion to verbal exercise you did, did they? How do they get knowledge of the images, is there also ai that can be used to describe images verbally. In short, what did it learn from to solve these types of puzzles?
A question. Since you did each test twice, did you notice any differences in the answers given by each model? That is: are models answering in a stable way, or they provide very different answers each time? I think this is something that needs to be tested, beyond the number of correct answers.
Check out blue bar charts further down in the post. You can see which questions it got right twice, vs once. There is at least some instability with very different answers, but not too much (at least in Claude 3; Claude 2 had more.)
Sorry if the question is stupid, but did you use Claude 3 Opus or Sonnet (https://twitter.com/anthropicai/status/1764653830468428150)?
Not a stupid question: Opus. Which is supposed to be the smartest one.
It's important to keep in mind the possibility of training set contamination when doing tests like this. For example, the authors of this study (https://arxiv.org/abs/2402.19450) replaced problems from a math benchmark with similar problems they made "from scratch" and found that the models were able to solve 50-80 percent fewer problems. The authors argue this was because because copies of the problems had been included (probably accidentally) in the training set and as a result the models simply memorized the answers to the question rather than learning generalized principles that would allow them to answer other questions. Something similar could be going on with these IQ tests, in which you wouldn't be testing what you think you're testing.
Thanks. Agreed that this is a potential problem, though even if it is the case, the analysis shows two things:
-- More modern models are dramatically better at querying the appropriate section of their "memory" (or they have more relevant memory to query.)
-- The top AIs are still getting the easy questions right, as opposed to the harder ones. I think this is some evidence that the questions are not in their training data, though it's far from proof. But what it does show conclusivley is that the AIs don't have a whole answer key readily available in their memory.
I do want to get my hands on an offline-only IQ test, though. Maybe a library somewhere...
The same was true of the MATH benchmark in that study I linked to before. GPT-4 went from solving ~25 percent of static MATH problems to ~12 percent of problems that were generated dynamically. I think you are measuring something real as far as differences between the various LLMs goes.
But it strikes me as a huge leap from that to "simple extrapolation of current growth rates suggested that Claude-6 would get all the IQ questions right, and be smarter than just about everyone, in about 4 - 10 years." Because an LLM with an IQ of 150 (as measured by possible-memorized IQ tests) might be very different from a human being with an IQ of 150 because human beings have much less capacity for rote memorization than an LLM.
Maybe so. I’ll report back when I get an offline test.
Just want to voice how necessary this is. We need to get past the standard "this is just regurgitating training data" arguments so we can get to the meat of what to do about these new intelligences we're making.
I run the leaked version of Mistral Medium on a dual GPU computer at home. Would love to test it because it is the best model that can be run locally on a high-end consumer grade PC.
Nice! I could send you the text questions I used. Alternatively, if you can wait a couple of weeks, Mistral is on my list to test as well.
Thanks for sharing such as nteresting topic!
GPT-4 Turbo has significantly improved intelligence - over GPT-4. You might want to consider testing that.
Good to know, thank you.
If you want to I can make a visual IQ test that would be transparent and fair for AIs. I have made verbal, trivia, matrice IQ tests. Hundreds of items and tested them on thousands of people. It would be very easy to make a test that all AIs can see visually as it seems to have issues understanding answer option rows. The Norway Mensa test is decent, but very imprecise as it only has a few items, and they are way too hard, and you don't have enough time. It's based on Raven's from 1932 and then a German speed test. The Raven's was not timed initially, but is now way too easy after Flynn effect. The speed test is great, but imprecise at IQ scores above 120.
Norway Mensa works via the timer. Yet AI doesn't really need time to solve things just data. So you can't compare it directly to single humans in a test made for a specific time format - maybe. But at any rate there are a ton of ideas to explore and untimed tests too. I suggest making a verbal test, trivia test, math test, logic test, and matrice test. And then compare IQ scores on all the tests to see what IQ AIs have in all aspects of testing. If you for example look at the most popular test in the world, WAIS, it will measure sub factors like verbal IQ, math skills, and visual perception. So not just the overall IQ. And AI should basically ace any verbal IQ test with clear answer choices. The issue is that it's actually like 50 IQ right now if you don't give it verbal instructions. By mixing verbal and non-verbal items you don't clearly know what part of the intelligence you are measuring.
Jurij, you can email me, maxim.lott@gmail.com
It would be great to discussing creating some new questions for these AIs further.
Best,
Maxim
That would be quite interesting! I would be happy to run it if you make such a matrix test.
I think the most important thing for AI visual tests is that the shapes are very VISIBLE. For example, in my last post, when I gave AIs the visual test, they were relatively likely to get question 13 on Norway Mensa, which, while the logic pattern isn't ultra-simple, all the shapes and colors are unambiguous. (Big triangles, circles, squares.)
Seems very sensible. If we care about artificial general intelligence and we can't be sure g-loading of tests is similar to humans (e.g. verbal is going to be lower - https://emilkirkegaard.dk/en/2023/05/which-test-has-the-highest-g-loading/), then "making a verbal test, trivia test, math test, logic test, and matrice test" could pin down the best test for g.
For verbal tests it's crystalized intelligence. So something you have learned over many years. Which is why processing power and speed is not measured or needed here. Many old people have very high crystalized intelligence, but their fluid intelligence declined with years. So you are basically measuring how much they have learned over their life and don't use any timer for it. This is the only test AIs can pass. But just like old people it may for example not be able to learn much new stuff and get confused by technology or too many variables. And anything speed related is not even relevant for it.
That's why I think AIs fluid intelligence is in the mentally retarded range. It's savant just like Kim Peek who the movie Rain Man is based on. But obviously no one would have put Kim Peek in charge of a giant tech company or anything close to that. Hence revealing why we measure different factors in intelligence. Often in jobs you get a simple IQ test, but it will measure everything in one. They would never be interested in measuring your verbal IQ only unless you legit have to work alone on writing. And even so it would imply reading the test yourself. Opening it, reading instructions, filling out all fields carefully. AIs can't do that. It's like asking an old man about words without seeing if he can actually even take a test. Of course in this case AIs did solve verbal puzzles. But you can't know what intelligence they used for it as they have so much data saved that you can't know what data they use at any one time.
Probably that's true, but AIs have got so much "crystalized intelligence" that most of the times can mimic our "fluid intelligence".
Good post.
I'm not sure investing in AI companies is a good idea though, it's probably better to hedge your bets and make investments assuming AI doesn't amount to much, since the world where it does is so good/destroyed anyway.
Plausible! I guess it matters just how “post scarcity” one thinks a world with super-powerful friendly AI would be
Thanks for all the work. This is the most fascinating Substack piece I've read in a long time. Please continue this work.
For what it's worth, the pattern I see in Exercise #27:
• the scissors rotate CW by 90° (left to right) in each row;
• the scissors rotate CCW by 90° (top to bottom) in each column; and
• obviously, the answer has to be wide open scissors.
Unfortunately, I didn't see this pattern initially (I picked B).
Thank you!
Any particular reason you didn't test "Le Chat Mistral" (https://mistral.ai/)?
It solves #2 (2/2) and fails on #27 (0/2) (using the instructions you provided).
From my site TrackingAI.org, I've noticed it misunderstands a relatively high proportion of questions (compared to other AIs.) Obviously that's not a reason *not* to rank it, but it's why I deprioritized it here for the sake of time. But in the future, I'll make it more comprehensive.
Maybe it'd do better in French?
Happy to share the full verbalized set of questions with you, or anyone, if you'd like to replicate it.
That'd be great!
This is very valuable, thank you.
Thank you!
When describing your initial labeling mistake in exercise 2, you made another mistake: """
In the initial version of my last post, I correctly described the right answer (“a square with a triangle in it”)
""" should be "a square with a hollow diamond in it", there are no triangles in the exercise
I don't get how the training data encoded for the IQ questions in the open models. I'm assuming they wouldn't have gone through the conversion to verbal exercise you did, did they? How do they get knowledge of the images, is there also ai that can be used to describe images verbally. In short, what did it learn from to solve these types of puzzles?
It could be in the training data via the YouTube transcript. Or, through some forum discussing the problem in words, which I am unaware of.
Can you send me the text questions you used on pastebin. I am going to try this with gpt4-o
A question. Since you did each test twice, did you notice any differences in the answers given by each model? That is: are models answering in a stable way, or they provide very different answers each time? I think this is something that needs to be tested, beyond the number of correct answers.
Check out blue bar charts further down in the post. You can see which questions it got right twice, vs once. There is at least some instability with very different answers, but not too much (at least in Claude 3; Claude 2 had more.)
Thank you. I always struggle with the randomness in some of these models
Hi Maxim, I am wondering where you obtained the raw score to FSIQ norms? They look kinda iffy to me.
Hey -- for scores over the threshold, I just plugged in the AI answers into Mensa Norway's quiz, and used the score they gave.
13 questions right gets you 85, for example, and 14 gets you 88. For every question below 85, I subtracted 3 points as an estimate.
Could you explain how you "questions right" column has decimal numbers 7.5, 10.5, 18.5?
Because it’s the average across two test administrations.