The idea behind this particular benchmark, at least, is that it can't be gamed. What are some ways to game ARC-AGI, meaning to pass it without developing the required internal model and insights?
In principle you can't optimize specifically for ARC-AGI, train against it, or overfit to it, because only a few of the puzzles are publicly disclosed.
Whether it lives up to that goal, I don't know, but their approach sounded good when I first heard about it.
Well, with billions in funding you could task a hundred or so very well paid researchers to do their best at reverse engineering the general thought process which went into ARC-AGI, and then generate fresh training data and labeled CoTs until the numbers go up.
Right, but the ARC-AGI people would counter by saying they're welcome to do just that. In doing so -- again in their view -- the researchers would create a model that could be considered capable of AGI.
I spent a couple of hours looking at the publicly-available puzzles, and was really impressed at how much room for creativity the format provides. Supposedly the puzzles are "easy for humans," but some of them were not... at least not for me.
(It did occur to me that a better test of AGI might be the ability to generate new, innovative ARC-AGI puzzles.)
It's tricky to judge the difficulty of these sorts of things. Eg, breadth of possibilities isn't an automatic sign of difficulty. I imagine the space of programming problems permits as much variety as ARC-AGI, but since we're more familiar with problems presented as natural language descriptions of programming tasks, and since we know there's tons of relevant text on the web, we see the abstract pictographic ARC-AGI tasks as more novel, challenging, etc. But, to an LLM, any task we can conceive of will be (roughly) as familiar as the amount of relevant training data it's seen. It's legitimately hard to internalize this.
For a space of tasks which are well-suited to programmatic generation, as ARC-AGI is by design, if we can do a decent job of reverse engineering the underlying problem generating grammar, then we can make an LLM as familiar with the task as we're willing to spend on compute.
To be clear, I'm not saying solving these sorts of tasks is unimpressive. I'm saying that I find it unsuprising (in light of past results) and not that strong of a signal about further progress towards the singularity, or FOOM, or whatever. For any of these closed-ish domain tasks, I feel a bit like they're solving Go for the umpteenth time. We now know that if you collect enough relevant training data and train a big enough model with enough GPUs, the training loss will go down and you'll probably get solid performance on the test set. Trillions of reasonably diverse training tokens buys you a lot of generalization. Ie, supervised learning works. This is the horse Ilya Sutskever's ridden to many glorious victories and the big driver of OpenAI's success -- a firm belief that other folks were leaving A LOT of performance on the table due to a lack of belief in the power of their own inventions.
In principle you can't optimize specifically for ARC-AGI, train against it, or overfit to it, because only a few of the puzzles are publicly disclosed.
Whether it lives up to that goal, I don't know, but their approach sounded good when I first heard about it.