Discussion about this post

User's avatar
Milli's avatar

Sorry if the question is stupid, but did you use Claude 3 Opus or Sonnet (https://twitter.com/anthropicai/status/1764653830468428150)?

Expand full comment
Timothy B. Lee's avatar

It's important to keep in mind the possibility of training set contamination when doing tests like this. For example, the authors of this study (https://arxiv.org/abs/2402.19450) replaced problems from a math benchmark with similar problems they made "from scratch" and found that the models were able to solve 50-80 percent fewer problems. The authors argue this was because because copies of the problems had been included (probably accidentally) in the training set and as a result the models simply memorized the answers to the question rather than learning generalized principles that would allow them to answer other questions. Something similar could be going on with these IQ tests, in which you wouldn't be testing what you think you're testing.

Expand full comment
40 more comments...

No posts