Their discussion contains an interesting aside:

> Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.

So while these tasks get greatest interest as a benchmark for LLMs and other large general models, it doesn't yet seem obvious those outperform human-designed domain-specific approaches.

I wonder to what extent the large improvement comes from OpenAI training deliberately targeting this class of problem. That result would still be significant (since there's no way to overfit to the private tasks), but would be different from an "accidental" emergent improvement.