OpenAI spent approximately $1,503,077 to smash the SOTA on ARC-AGI with their new o3 model
semi-private evals (100 tasks):
75.7% @ $2,012 total/100 tasks (~$20/task) with just 6 samples & 33M tokens processed in ~1.3 min/task and a cost of $2012
The “low-efficiency” setting with 1024 samples scored 87.5% but required 172x more compute.
If we assume compute spent and cost are proportional, then OpenAI might have just spent ~$346.064 for the low efficiency run on the semi-private eval.
On the public eval they might have spent ~$1.148.444 to achieve 91.5% with the low efficiency setting. (high-efficiency mode: $6677)
OpenAI just spent more money to run an eval on ARC than most people spend on a full training run.
It sounds like they essentially brute-forced the solutions ?
Ask LLM for answer, answer for LLM to verify the answer. Ask LLM for answer, answer for LLM to verify the answer. Add a bit of randomness. Ask LLM for answer, answer for LLM to verify the answer. Add a bit of randomness. Repeat 5B times (this is what the paper says).
Evolution itself is the ultimate brute-force algorithm—it’s just applied over millennia. Trial and error, coupled with selection and refinement, is the only way to generate novelty when there’s no clear blueprint.
By my estimates, for this single benchmark, this is comparable cost to training a ~70B model from scratch today. Literally from 0 to a GPT-3 scale model for the compute they ran on 100 ARC tasks.
I double checked with some flop estimates (P100 for 12 hours = Kaggle limit, they claim ~100-1000x for O3-low, and x172 for O3-high) so roughly on the order of 10^22-10^23 flops.
In another way, using H100 market price $2/chip -> at $350k, that's ~175k hours. Or 10^24 FLOPs in total.
So, huge margin, but 10^22 - 10^24 flop is the band I think we can estimate.
These are the scale of numbers that show up in the chinchilla optimal paper, haha. Truly GPT-3 scale models.
Yes that's correct and there's a bit of "pixel math" as well so take these numbers with a pinch of salt. Preliminary model sizes from the temporarily public HF repository puts the full model size at 8tb or roughly 80 H100s
I didn't hear that but it could be. But it doesn't matter really because there's so much more to consider in the cost, R&D, including all the supporting functions of a model like censorship and data capture and so on.
semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks (~$20/task) with just 6 samples & 33M tokens processed in ~1.3 min/task and a cost of $2012
The “low-efficiency” setting with 1024 samples scored 87.5% but required 172x more compute.
If we assume compute spent and cost are proportional, then OpenAI might have just spent ~$346.064 for the low efficiency run on the semi-private eval.
On the public eval they might have spent ~$1.148.444 to achieve 91.5% with the low efficiency setting. (high-efficiency mode: $6677)
OpenAI just spent more money to run an eval on ARC than most people spend on a full training run.