- "Are You Human" https://arxiv.org/pdf/2410.09569 is designed to be directly on target, i.e. cross cutting set of questions that are easy for humans, but challenging for LLMs, Instead of one type of visual puzzle. Much better than ARC for the purpose you're looking for.
- SimpleBench https://simple-bench.com/ (similar to above; great landing page w/scores that show human / ai gap)
- PIQA (physical question answering, i.e. "how do i get a yolk out of a water bottle", common favorite of local llm enthusiasts in /r/localllama https://paperswithcode.com/dataset/piqa
AI search googled "llm benchmarks challenging for ai easy for humans", and "language model benchmarks that humans excel at but ai struggles with", and "tasks that are easy for humans but difficult for natural language ai".
It also mentioned Moravec's Paradox is a known framing of this concept, started going down that rabbit hole because the resources were fascinating, but, had to hold back and submit this reply first. :)
Thanks for the pointers! I hadn't seen Are You Human. Looks like it's only two months old. Of course it is much easier to design a test specifically to thwart LLMs now that we have them. It seems to me that it is designed to exploit details of LLM structure like tokenizers (e.g. character counting tasks) rather than to provide any sort of general reasoning benchmark. As such it seems relatively straightforward to improve performance in ways that wouldn't necessarily represent progress in general reasoning. And today's LLMs are not nearly as far from human performance on the benchmark as they were on ARC for many years after it was released.
SimpleBench looks more interesting. Also less than two months old. It doesn't look as challenging for LLMs as ARC, since o1-preview and Sonnet 3.5 already got half of the human baseline score; they did much worse on ARC. But I like the direction!
PIQA is cool but not hard enough for LLMs.
I'm not sure Berkeley Function-Calling represents tasks that are "easy" for average humans. Maybe programmers could perform well on it. But I like ARC in part because the tasks do seem like they should be quite straightforward even for non-expert humans.
Moravec's paradox isn't a benchmark per se. I tend to believe that there is no real paradox and all we need is larger datasets to see the same scaling laws that we have for LLMs. I see good evidence in this direction: https://www.physicalintelligence.company/blog/pi0
> "I'm not sure Berkeley Function-Calling represents tasks that are easy for average humans. Maybe programmers could perform well on it."
Functions in this context are not programming function calls. In this context, function calls are a now-deprecated LLM API name for "parse input into this JSON template." No programmer experience needed. Entity extraction by another name, except, that'd be harder: here, you're told up front exactly the set of entities to identify. :)
> "Moravec's paradox isn't a benchmark per se."
Yup! It's a paradox :)
> "Of course it is much easier to design a test specifically to thwart LLMs now that we have them"
Yes.
Though, I'm concerned a simple yes might be insufficient for illumination here.
It is a tautology (it's easier to design a test that $X fails when you have access to $X), and it's unlikely you meant to just share a tautology.
A potential unstated-but-maybe-intended-communication is "it was hard to come up with ARC before LLMs existed" --- LLMs existed in 2019 :)
If they didn't, a hacky way to come up with a test that's hard for the top AIs at the time, BERT-era, would be to use one type of visual puzzle.
If, for conversations sake, we ignore that it is exactly one type of visual puzzle, and that it wasn't designed to be easy for humans, then we can engage with: "its the only one thats easy for humans, but hard for LLMs" --- this was demonstrated as untrue as well.
I don't think I have much to contribute past that, once we're at "It is a singular example of a benchmark thats easy for humans but nigh-impossible for llms, at least in 2019, and this required singular insight", there's just too much that's not even wrong, in the Pauli sense, and it's in a different universe from the original claims:
- "Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far."
- "A lot of people have criticized ARC as not being relevant or indicative of true reasoning...The fact that [o-series models show progress on ARC proves that what it measures really is relevant and important for reasoning."
- "...nobody could quantify exactly the ways the models were deficient..."
- "What we need right now are "easy" benchmarks that these models nevertheless fail."