19 Comments

I tried the reverse approach of getting AIs to generate matrices items. It completely failed. It is however good at number series, a kind of reasoning.

https://www.emilkirkegaard.com/p/how-to-make-psychology-items-questions

Expand full comment

Hi,

Great post! Would you be fine with me creating a prediction market on Manifold based on whether GPT-5 will score at least 100 based on your first similar test on GPT-5 (conditional on you doing such test)?

Expand full comment
author

Yes! Thank you.

Expand full comment

The market can be found here:

https://manifold.markets/Guuber3/will-gpt5-score-at-least-100-in-an

If there's some suggestions / clarifications, please do tell me and I can edit the market!

Expand full comment
Feb 27·edited Feb 27Liked by Maxim Lott

I think it's likely wishful thinking to point out the things AI is still bad at, when the things that it's good at are so good, that we can't even guess how it thinks. When Garry Kasparov lost to a computer, he said that he felt he was playing chess with an alien. When I use Midjourney, I am constantly amazed at how creative it is in blending different styles of art, architecture, furniture, etc. to create structures that are both aesthetic and functional. Of course it makes weird mistakes sometimes, or imagines impossible geometry, but so did Escher. Another thing I find odd is that generating 3D models from 2D AI images is still in a primitive state, but the AI must be generating some sort of 3D mesh if it's doing perspective shading and lensing effects properly--which it does quite well. Even if we could figure out how to get AIs to explain their thought process in a way we could understand, they still think millions of times faster than us, so what's the point of trying to monitor their evolution?

I also don't think "strangling it in its crib" is even possible; it certainly wasn't with nuclear. The only effect of trying to restrict proliferation of nuclear power and weapons is that those who restrict only hamper their own security, no one else's. The US certainly tried to prevent the leakage of nuclear bomb technology, but it only took a few years for the USSR to replicate it, and eventually sell it to partners who could not make their own. Of course there were many scientists who insisted that we never should have developed them in the first place, but very few of them were thinking that way in the midst of WWII; many were terrified the Germans would succeed first. The nature of arms races is that if you voluntarily opt out, you might be destroyed, or put yourself at severe disadvantage.

The choice to turn back the clock on nuclear power, however, is far more detrimental to progress than dismantling nuclear arms inventories. At least if people wish to move away from dependence on oil and gas. Will AI prove to be more like nuclear power or nuclear weapons? Both, it seems, in a way that we cannot conveniently separate. At some point, we will lose our ability to control technology like this, and I'd wager sooner than later. The current ineptitude with IQ tests will very likely be solved later this year. Unlike with humans, innovations in machine intelligence propagate at high speed and raise the floor for all competitors. It seems impossible for humans to remain in control of this for long, and even if we could, your piece on Google's current woke bias makes me more uncomfortable than what sentient machines might choose to do of their own volition. Orwell's dark vision resembled the Chinese government today, with people at the top controlling a surveillance apparatus. We can't be sure that an autonomous AI would be worse than that, any more than we can expect it to be better.

Expand full comment
Feb 27Liked by Maxim Lott

“A subject for future research: how well does ChatGPT do if the problems are spelled out verbally for it?”

That's not a comparable test, though, because verbally you are obliged to draw its attention to things it might not otherwise have noticed.

Expand full comment
author
Feb 27·edited Feb 27Author

I am already working on "translating" the answers. I think it is possible to give a relatively objective readings of the shapes, e.g., "square with a diamond in it", "circle with a square in it", etc.

But yes, it's possible some hints will seep through, and once it's done, I agree it shouldn't be taken to be exactly the same test.

Expand full comment
Feb 28Liked by Maxim Lott

I just don't think that those descriptions match what a truly unintelligent person functionally sees. Ask a stupid person what they were thinking when they fail these tests and sometimes it didn't even occur to them that the shapes were relevant. “There's just something in the middle innit.” They didn't really notice that there were shapes there.

I think that like most of us here you are too intelligent to be able even to imagine what it would be like to be that stupid. You can't help yourself from noooticing the patterns you see.

Expand full comment
author

You're probably right about what unintelligent people see.

But what if ChatGPT isn't unintelligent -- it's more like a very intelligent semi-blind person? So I'm working on translating the test, basically my goal is that a smart blind person should be able to re-draw the puzzle using the instructions.

Still don't know the results, so it may or may not be a moot argument.

Expand full comment
Feb 27Liked by Maxim Lott

For the second test: “The answer is B.” Really?

Expand full comment
author

Doh. Like ChatGPT, my verbal description was correct, but I mixed up the answers. Fixed.

(And fortunately, the answer was marked correctly on the answer key used to grade the AIs.)

Expand full comment

I used this prompt for those images with ChatGPT-4o and it meant it got the answers correct. With this guided prompt, the AI always selected the wrong answer, even though it reasoned correctly:

I am going to give you some IQ test questions to see if you have reasoning ability. Use any internal technique that might help you, including self-reflection, chain of thought, or act like you are multiple agents to check you own work and challenge yourself to arrive at the right answer.

To ensure you avoid selecting the incorrect answer (letter) due to a disconnect between reasoning and execution, akin to knowing the right answer but selecting the wrong choice in a multiple-choice scenario, I want you first to internally describe each of the answer squares, A to F in terms of the position of the black square.

Then reason the exercise question and then lookup the correct answer by matching your reasoning to your descriptions of each answer possibility.

This will serve as a reminder that after arriving at a conclusion, it’s important to double-check that the answer selection accurately reflects the reasoning process.

Be mindful of the importance of staying attentive in both reasoning and execution phases to avoid simple but impactful errors.

Expand full comment

> I asked ChatGPT-4 to add up a string of numbers, for example, 34 + 5.2 + 9 + 0.2 + 7.1 + 3 + 11 + 18.889 + 15.532 + 1.1 + 3. It got it wrong.

How exactly did you prompt for this? I expected this would always be solved accurately because ChatGPT-4, which you specify using, would simply dump it into the Code Interpreter and get an exact answer. And when I do that myself, asking

> what is 34 + 5.2 + 9 + 0.2 + 7.1 + 3 + 11 + 18.889 + 15.532 + 1.1 + 3?

it correctly replies:

> The sum is 108.021. ​​

Because it used the Python REPL as follows:

> # Calculating the sum of the provided numbers

> total_sum = 34 + 5.2 + 9 + 0.2 + 7.1 + 3 + 11 + 18.889 + 15.532 + 1.1 + 3

> total_sum

(I get '108.02099999999999' due to different rounding in ghci, but same thing.)

Expand full comment
author
Mar 6·edited Mar 6Author

Just tried it again. I simply asked "Calculate 34 + 5.2 + 9 + 0.2 + 7.1 + 3 + 11 + 18.889 + 15.532 + 1.1 + 3"

For me, GPT-4 repeated the question and responded: 107.021

Off by 1.

For me it didn't explictly use code interpreter, python, or anything like that.

Separately, I have a new post which partly takes back the conclusions of this post: https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-100-iq

Expand full comment

Good tests.

You should know that Gemini's multimodal capabilities are currently not enabled. All of your image uploads are being serviced by Google Lens.

I tried testing Gemini's multimodal ability, and was stunned by how bad it was. Sky-high rates of refusals, hallucinations, etc. I soon began to suspect that my queries were being silently routed to Lens, and now we basically have confirmation (from Jack Krawczyk) that this is exactly the case.

https://www.reddit.com/r/Bard/comments/1amcmmn/multimodal_upgrades_are_coming_to_gemini_advanced/

To be honest, I think this is a scumbag move from Google. They advertised Gemini Ultra has having SOTA multimodal abilities...but you can't use them. And they fail to mention anywhere that you can't use them. If I signed up for Gemini Advanced ($20/m) because Lens wasn't cutting it and I wanted Gemini's multimodality, I'd be pretty pissed off.

Expand full comment
author

Good to know! Thank you.

Expand full comment

So, even for a human child you would need to explain what's happening here. In a particular, it would be important to explain to the AI which part of the diagram is the answer, that there is only one possible answer and that the answer should fit as many patterns as possible.

Expand full comment
author

Maybe for a child, but honestly, reading the AI answers, it doesn't seem to have a problem understanding what is asked of it. It just has trouble fulfilling the mission.

Separately, FYI, I have a new post which partly takes back the conclusions of this post: https://www.maximumtruth.org/p/ais-ranked-by-iq-ai-passes-100-iq

Expand full comment
deletedFeb 27
Comment deleted
Expand full comment
author

Yes, I agree. It's hard to say how orders of magnitude will be reflected on the IQ range.

Expand full comment