It's tricky to judge the difficulty of these sorts of things. Eg, breadth of pos...

It's tricky to judge the difficulty of these sorts of things. Eg, breadth of possibilities isn't an automatic sign of difficulty. I imagine the space of programming problems permits as much variety as ARC-AGI, but since we're more familiar with problems presented as natural language descriptions of programming tasks, and since we know there's tons of relevant text on the web, we see the abstract pictographic ARC-AGI tasks as more novel, challenging, etc. But, to an LLM, any task we can conceive of will be (roughly) as familiar as the amount of relevant training data it's seen. It's legitimately hard to internalize this.

For a space of tasks which are well-suited to programmatic generation, as ARC-AGI is by design, if we can do a decent job of reverse engineering the underlying problem generating grammar, then we can make an LLM as familiar with the task as we're willing to spend on compute.

To be clear, I'm not saying solving these sorts of tasks is unimpressive. I'm saying that I find it unsuprising (in light of past results) and not that strong of a signal about further progress towards the singularity, or FOOM, or whatever. For any of these closed-ish domain tasks, I feel a bit like they're solving Go for the umpteenth time. We now know that if you collect enough relevant training data and train a big enough model with enough GPUs, the training loss will go down and you'll probably get solid performance on the test set. Trillions of reasonably diverse training tokens buys you a lot of generalization. Ie, supervised learning works. This is the horse Ilya Sutskever's ridden to many glorious victories and the big driver of OpenAI's success -- a firm belief that other folks were leaving A LOT of performance on the table due to a lack of belief in the power of their own inventions.