Interesting post! I found DeepSeek to be pretty unimpressive - unempirical of course but my impression was that it seemed more like Chat 3.5 at best. It failed at some pretty basic stuff.
This is cool, those benchmark results from the private version are consistent with my subjective impression. I thought R1 lacked vision capability thought so I'm a bit confused about how it's answering questions like #24. Do you intend to have R1 take your political compass test as well?
Happy to hear that's next! I've been very curious about how reasoning models will fall on the political spectrum. I noticed that o1 is quite extreme in your testing but in the middle of the pack in David Rozado's testing, so I don't know what to think yet.
Chat has about a 30% fail rate on questions I ask - like ‘was there ever a grandma burger at A and W. it answers and then I ask ‘are you sure’ and it apologizes.
It has happened with recipes, novels and multiple other things.
"However, the Chinese government is one of the most authoritarian on Earth, and it’s plausible it will pressure China’s entrepreneurs to keep further AI developments more closely-held."
Whilst your government isn't, and it hasn't plausibly pressured their associated monopolists (or "entrepreneurs") into anything at all, nor is it plausible that it is going to do it in the coming times.
It is plausible that our government would pressure companies to do the same. But the level of authoritarianism beyond that (democracy vs not, and some other human rights things) are unfortunately very different, for now.
Well, the racket about DeepSeek had a beneficial effect for GPT users: one can hardly see it as a random event that GPT made o3-mini, with its Reason feature, available today.
And in fact, there is a wide *intelligence* gap between o3-mini and o4-mini and o4.
Interesting post! I found DeepSeek to be pretty unimpressive - unempirical of course but my impression was that it seemed more like Chat 3.5 at best. It failed at some pretty basic stuff.
This is cool, those benchmark results from the private version are consistent with my subjective impression. I thought R1 lacked vision capability thought so I'm a bit confused about how it's answering questions like #24. Do you intend to have R1 take your political compass test as well?
Thanks! The questions are described in words. And yes, it will be taking that quiz next.
Happy to hear that's next! I've been very curious about how reasoning models will fall on the political spectrum. I noticed that o1 is quite extreme in your testing but in the middle of the pack in David Rozado's testing, so I don't know what to think yet.
Chat has about a 30% fail rate on questions I ask - like ‘was there ever a grandma burger at A and W. it answers and then I ask ‘are you sure’ and it apologizes.
It has happened with recipes, novels and multiple other things.
Thanks a lot for the article. We also wrote an article about deepseek and china AI ecosystem, and chatGPT relating to NVDA stocks here:
🚨 AI just got 45x cheaper—DeepSeek built a GPT-4-level model for $5.6M, and if this scales, Nvidia’s AI monopoly might not last. 🚨
https://ghginvest.substack.com/p/ai-just-got-45x-cheaperand-it-might
"However, the Chinese government is one of the most authoritarian on Earth, and it’s plausible it will pressure China’s entrepreneurs to keep further AI developments more closely-held."
Whilst your government isn't, and it hasn't plausibly pressured their associated monopolists (or "entrepreneurs") into anything at all, nor is it plausible that it is going to do it in the coming times.
Here's to non-ideological reporting :).
It is plausible that our government would pressure companies to do the same. But the level of authoritarianism beyond that (democracy vs not, and some other human rights things) are unfortunately very different, for now.
Well, the racket about DeepSeek had a beneficial effect for GPT users: one can hardly see it as a random event that GPT made o3-mini, with its Reason feature, available today.
And in fact, there is a wide *intelligence* gap between o3-mini and o4-mini and o4.
Awesome! Check out my essay on pricing models based on IQ
https://blog.aaronamelgar.me/p/iqh-a-new-way-of-pricing-ai-systems