Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is also wildly ahead in SWE-bench (71.7%, previous 48%) and Frontier Math (25% on high compute, previous 2%).

So much for a plateau lol.



> So much for a plateau lol.

It’s been really interesting to watch all the internet pundits’ takes on the plateau… as if the two years since the release of GPT3.5 is somehow enough data for an armchair ponce to predict the performance characteristics of an entirely novel technology that no one understands.


The pundits response to the (alleged) plateau was proportional to the certainty with which CEOs of frontier labs discussed pre-training scaling. The o3 result is from scaling test time compute, which represents a meaningful change in how you would build out compute for scaling (single supercluster --> presence in regions close to users). Thus it is important to discuss.


You could make an equivalently dismissive comment about the hypesters.


Yeah but anyone with half a brain knows to ignore them. Vapid cynicism is a lot more seductive to the average nerd.


>Frontier Math (25% on high compute, previous 2%)

This is so insane that I can't help but be skeptical. I know FM answer key is private, but they have to send the questions to OpenAI in order to score the models. And a significant jump on this benchmark sure would increase a company's valuation...

Happy to be wrong on this.


Nope, makes sense to me. Seems unreasonable to conclude the dataset is not compromised now.


the question is whether that 25% jump is also because of the compromised first test.


viewed from a skeptical lens of incentives:

openai and epochai are both startups with every incentive to peddle this narrative. when no one else can independently verify.


You're talking apples and oranges. The plateau the frontier models have hit is the limited further gains to be had from dataset (+ corresponding model/compute) scaling.

These new reasoning models are taking things in a new direction basically by adding search (inference time compute) on top of the basic LLM. So, the capabilities of the models are still improving, but the new variable is how deep of a search you want to do (how much compute to throw at it at inference time). Do you want your chess engine to do a 10 ply search or 20 ply? What kind of real world business problems will benefit from this?


"New" reasoning models are plain LLMs with clever reinforcement learning. o1 is itself reinforcement learning on top GPT-4o.

They found a way to make test time compute a lot more effective and that is an advance but the idea is not new, the architecture is not new.

And the vast majority of people convinced LLMs plateaued did so regardless of test time compute.


The fact that these reasoning models may compute for extended durations, using exponentially more compute for linear performance gains (says OpenAI), resulting in outputs that while better are not necessarily any longer (more tokens) than before, all point to a different architecture - some type of iterative calling of the underlying model (essentially a reasoning agent using the underlying model).

A plain LLM does not use variable compute - it is a fixed number of transformer layers, a fixed amount of compute for every token generated.


Architecture generally refers to the design of the model. In this case, the underlying model is still a transformer based llm and so is its architecture.

What's different is the method for _sampling_ from that model where it seems they have encouraged the underlying LLM to perform a variable length chain of thought "conversation" with itself as has been done with o1. In addition, they _repeat_ these chains of thought in parallel using a tree of some sort to search and rank the outputs. This apparently scales performance on benchmarks as you scale both length of the chain of thought and the number of chains of thought.


No disagreement, although the sampling + search procedure is obviously adding quite a lot to the capabilities of the system as a whole, so it really should be considered as part of the architecture. It's a bit like AlphaGo or AlphaZero - generating potential moves (cf LLM) is only a component of the overall solution architecture, and the MCTS sampling/search is equally (or more) important.


Ah, I see. Yeah that's a fair assessment and in hindsight is probably the way architecture is being used in the article.


I think throwaway already explained what i was getting at.

That said, i probably did downplay the achievement. It may not be a "new" idea to do something like this but finding an effective method for reflection that doesn’t just lock you into circular thinking and is applicable beyond well defined problem spaces is genuinely tough and a breakthrough.


I legit see that if there is not even a new breakthrough just one week, people start shouting plateau plateau.. Our rate of progress is extraordinary and any downplay of it seems like stupid


At 6,670$/task? I hope there's a jump


It's not 6,670$/task. That was the high efficiency cost for 400 questions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: