And o3 will not run on the private set unless it is a truly free and open source model (presumably also the case for ARC-AGI-2). This is the distinction between private and semi-private. In private you provide all the knowledge/weights/logic to operate without any external communication. Private benchmark results are the only true evaluation of performance on any benchmark -- reserved for a final evaluation. It is the only way to prevent shenanigans.
There’s a fully private test set too as I understand it, that o3 hasn’t run on yet.