I don't care about benchmarks. O1 ranks higher than Claude on "benchmarks" but p...

famouswaffles · on Dec 21, 2024

In most non-competitive coding benchmarks (aider, live bench, swe-bench), o1 ranks worse than Sonnet (so the benchmarks aren't saying anything different) or at least did, the new checkpoint 2 days ago finally pushed o1 over sonnet on livebench.

tigershark · on Dec 21, 2024

As I said, o3 demonstrated field medal level research capacity in the frontier math tests. But I’m sure that your use cases are much more difficult than that, obviously.

riku_iki · on Dec 21, 2024

there are many comments in internet about this, that only subset of frontier math benchmark is "field medal level research", and o3 likely scored on easier subset.

Also, all that stuff is shady in the way that it is just numbers from OAI, which are not reproducible on benchmark sponsored by OAI. If we say OAI could be bad actor, they had plenty of opportunities to cheat on this.

whynotminot · on Dec 21, 2024

“Objective benchmarks are useless, let’s argue about which one works better for me personally.”

csomar · on Dec 21, 2024

Yes. My benchmarks and their benchmarks means AGI. Their benchmarks only means over-fitted.

whynotminot · on Dec 21, 2024

Ok so what if we get different results for our own personal benchmarks/use cases.

(See why objective benchmarks exist?)

bakugo · on Dec 21, 2024

Yes, "objective" benchmarks can be gamed, real-life tasks cannot.