Discussion about this post

User's avatar
Greg's avatar

The God Machine is on its way

Expand full comment
Nick Q.'s avatar

Really appreciate this kind of longitudinal benchmarking—especially when the same test is repeated over time. That consistency is rare and valuable.

That said, it’s increasingly important to ask what we’re actually measuring as models start outperforming humans on IQ tests.

We’re still rewarding fluency under constraint—token-level pattern extension—not cognitive flexibility or causal reasoning. Rising scores may reflect emergent capabilities, but they also reveal growing prompt sensitivity and structural scaffolding around task framing.

I’ve been exploring a modular prompt framework (Lexome) to help separate prompt design from true model capability. Benchmarks like these are great, but we also need tests that hold structure constant to see what’s really improving.

Would love to hear from others thinking about uncertainty-aware evaluation or cognitive framing.

Expand full comment
11 more comments...

No posts