I know it's against the rules but I thought this transcript in Google Search was...

gnatman · 2026-02-23T20:54:06 1771880046

LLMs sure do love to burn tokens. It’s like a high schooler trying to meet the minimum word length on a take home essay.

Aurornis · 2026-02-24T04:48:31 1771908511

The long incremental reasoning is how they arrive at higher quality answers.

Some applications hide the reasoning tokens from view, but then the final answer appears delayed.

sambaumann · 2026-02-23T21:00:17 1771880417

I feel like this has gotten much worse since they were introduced. I guess they're optimizing for verbosity in training so they can charge for more tokens. It makes chat interfaces much harder to use IMO.

I tried using a custom instruction in chatGPT to make responses shorter but I found the output was often nonsensical when I did this

gs17 · 2026-02-23T21:26:56 1771882016

Yeah, ChatGPT has gotten so much worse about this since the GPT-5 models came out. If I mention something once, it will repeatedly come back to it every single message after regardless of if the topic changed, and asking it to stop mentioning that specific thing works, except it finds a new obsession. We also get the follow up "if you'd like, I can also..." which is almost always either obvious or useless.

I occasionally go back to o3 for a turn (it's the last of the real "legacy" models remaining) because it doesn't have these habits as bad.

felix089 · 2026-02-23T21:42:56 1771882976

It's similar for me, it generates so much content without me asking. if I just ask for feedback or proofreading smth it just tends to regenerate it in another style. Anything is barely good to go, there's always something it wants to add

j_bum · 2026-02-24T06:01:51 1771912911

Claude is so much better for proofing, IMO.

Over the last few years I’ve rotated between OpenAI and Anthropic models on about a 4-5 month cycle. I just started my Anthropic cycle because of my annoyance with the GPT-5.2 verbosity

In four months when opus is annoying me and I forget my grievances with OpenAI’s models and switch back, I’ll report back lol.

abustamam · 2026-02-24T07:03:03 1771916583

It's also annoying when it starts obsessing over stuff from other chats! Like I know it has a memory of me but geez, I mention that I want to learn more about systems design and now every chat, even recipes, is like "Architect mode - your garlic chicken recipe"

Like, no, stop that! Keep my engineering life separate from my personal life!

causal · 2026-02-24T04:03:13 1771905793

I'm suspicious it's something far worse: they're increasingly being trained on their own output scraped from the wild.

dist-epoch · 2026-02-24T07:04:43 1771916683

Because that's where the compute happens, in those "verbose" tokens. A transformer has a size, it can only do so many math operations in one pass. If your problem is hard, you need more passes.

Asking it to be shorter is like doing fewer iteration of numerical integral solving algorithm.

sambaumann · 2026-02-24T13:49:54 1771940994

Yeah, but all the models live in chatGPT have reasoning (iirc) - they could use reasoning tokens to do the 'compute', and still show the user a succinct response that directly answers the query

abustamam · 2026-02-24T07:00:28 1771916428

Oh good, it's not just me. Sometimes I'd have it draft an email or something and then the message seems perfect but then it's like "tell me more about the recipient and I'll make it better."

Like, my guy, I don't want to keep prompting you to make shit better, if you're missing info, ask me, don't write a novel then say "BTW, this version sucked"

Yes, I know this could probably be resolved via better prompting or a system prompt, but it's still annoying.

estimator7292 · 2026-02-23T20:59:38 1771880378

I've always wondered about that. LLM providers could easily decimate the cost of inference if they got the models to just stop emitting so much hot air. I don't understand why OpenAI wants to pay 3x the cost to generate a response when two thirds of those tokens are meaningless noise.

ben_w · 2026-02-23T21:24:51 1771881891

Because they don't yet know how to "just stop emitting so much hot air" without also removing their ability to do anything like "thinking" (or whatever you want to call the transcript mode), which is hard because knowing which tokens are hot air is the hard problem itself.

They basically only started doing this because someone noticed you got better performance from the early models by straight up writing "think step by step" in your prompt.

mikepurvis · 2026-02-24T02:48:18 1771901298

I would guess that by the time a response is being emitted, 90% of the actual work is done. The response has been thought out, planned, drafted, the individual elements researched and placed.

It would actually take more work to condense that long response into a terse one, particularly if the condensing was user specific, like "based on what you know about me from our interactions, reduce your response to the 200 words most relevant to my immediate needs, and wait for me to ask for more details if I require them."

tbossanova · 2026-02-24T03:28:07 1771903687

“Sorry for the long letter, I would have written a shorter one but I didn’t have the time.”

Terr_ · 2026-02-23T21:29:20 1771882160

IMO it supports the framing that it's all just a "make document longer" problem, where our human brains are primed for a kind of illusion, where we perceive/infer a mind because, traditionally, that's been the only thing that makes such fitting language.

ben_w · 2026-02-23T21:51:33 1771883493

To an extent. Even though they're clearly improving*, they also definitely look better than they actually are.

* this time last year they couldn't write compilable source code for a compiler for a toy language, I know because I tried

hansvm · 2026-02-24T04:09:18 1771906158

This time last year they could definitely write compilable source code for a compiler for a toy language if you bootstrapped the implementation. If you, e.g., had it write an interpreter and use the source code as a comptime argument (I used Zig as the backend -- Futamura transforms and all that), everything worked swimmingly. I wasn't even using agents; ChatGPT with a big context window was sufficient to write most of the compiler for some language for embedded tensor shenanigans I was hacking on.

ben_w · 2026-02-24T08:51:49 1771923109

Used to need the "if", now SOTA doesn't.

SOTA today has a different set of caveats, of course.

ferris-booler · 2026-02-24T04:18:23 1771906703

An LLM uses constant compute per output token (one forward pass through the model), so the only computational mechanism to increase 'thinking' quantity is to emit more tokens. Hence why reasoning models produce many intermediary tokens that are not shown to the user, as mentioned in other replies here. This is also why the accuracy of "reasoning traces" is hotly debated; the words themselves may not matter so much as simply providing a compute scratch space.

Alternative approaches like "reasoning in the latent space" are active research areas, but have not yet found major success.

zahlman · 2026-02-24T02:49:09 1771901349

My assumption has been that emitting those tokens is part of the inference, analogous to humans "thinking out loud".

abustamam · 2026-02-24T07:04:32 1771916672

You're absolutely right!

observationist · 2026-02-23T22:01:29 1771884089

This is an active research topic - two papers on this have come out over the last few days, one cutting half of the tokens and actually boosting performance overall.

I'd hazard a guess that they could get another 40% reduction, if they can come up with better reasoning scaffolding.

Each advance over the last 4 years, from RLHF to o1 reasoning to multi-agent, multi-cluster parallelized CoT, has resulted in a new engineering scope, and the low hanging fruit in each place gets explored over the course of 8-12 months. We still probably have a year or 2 of low hanging fruit and hacking on everything htat makes up current frontier models.

It'll be interesting if there's any architectural upsets in the near future. All the money and time invested into transformers could get ditched in favor of some other new king of the hill(climbers).

https://arxiv.org/abs/2602.02828 https://arxiv.org/abs/2503.16419 https://arxiv.org/abs/2508.05988

Current LLMs are going to get really sleek and highly tuned, but I have a feeling they're going to be relegated to a component status, or maybe even abandoned when the next best thing comes along and blows the performance away.

tempestn · 2026-02-24T03:25:15 1771903515

The one that always gets me is how they're insistent on giving 17-step instructions to any given problem, even when each step is conditional and requires feedback. So in practice you need to do the first step, then report the results, and have it adapt, at which point it will repeat steps 2-16. IME it's almost impossible to reliably prevent it from doing this, however you ask, at least without severely degrading the value of the response.

mitthrowaway2 · 2026-02-24T04:11:33 1771906293

I can only imagine that someone's KPIs are tied to increasing rather than decreasing token usage.

sambaumann · 2026-02-23T21:00:54 1771880454

because for API users they get to charge for 3x the tokens for the same requests

mattclarkdotnet · 2026-02-24T02:36:47 1771900607

Because inference costs are negligible compared to training costs

CamperBob2 · 2026-02-23T21:19:00 1771881540

The 'hot air' is apparently more important than it appears at first, because those initial tokens are the substrate that the transformer uses for computation. Karpathy talks a little about this in some of his introductory lectures on YouTube.

Terr_ · 2026-02-23T21:26:36 1771881996

Related are "reasoning" models, where there's a stream of "hot air" that's not being shown to the end-user.

I analogize it as a film noir script document: The hardboiled detective character has unspoken text, and if you ask some agent to "make this document longer", there's extra continuity to work with.

zwarag · 2026-02-23T21:24:07 1771881847

well, they probably have quite a lot of text from high schoolers trying to meet the minimum word length on a take home essay in the training data

1024core · 2026-02-24T04:34:31 1771907671

Solution: just add "no yapping" to the prompt.

bartvk · 2026-02-24T06:43:44 1771915424

Same. I usually add a "Be curt" in front of every prompt in Gemini.

CamperBob2 · 2026-02-25T00:34:33 1771979673

Is that more effective than simply adding it to your user instructions?

BloondAndDoom · 2026-02-24T04:59:52 1771909192

I mean their whole existence is about token prediction, so they just want to do their things :)

HPsquared · 2026-02-23T20:38:01 1771879081

I wonder to what extent the Google search LLM is getting smarter, or simply more up-to-date on current hot topics.

mlazowik · 2026-02-23T20:45:26 1771879526

It seems like the search ai results are generally misunderstood, I also misunderstood them for the first weeks/months.

They are not just an LLM answer, they are an (often cached) LLM summary of web results.

This is why they were often skewed by nonsensical Reddit responses [0].

Depending on the type of input it can lean more toward web summary or LLM answer.

So I imagine that it can just grab the description of the „car wash” test from web results and then get it right because of that.

[0] https://www.bbc.com/news/articles/cd11gzejgz4o

PaulHoule · 2026-02-23T20:44:36 1771879476

Presumably it did an actual search and summarized the results and neither answered "off the cuff" by following gradients to reproduce the text it was trained on nor by following gradients to reproduce the "logic" of reasoning. [1]

[1] e.g. trained on traces of a reasoning process

popalchemist · 2026-02-23T20:41:25 1771879285

It's almost certainly just RAG powered by their crawler.

esafak · 2026-02-23T21:17:46 1771881466

Proving that RAG still matters.

silasb · 2026-02-23T21:11:51 1771881111

Gemini was a good laugh as well:

    Silas: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
    Gemini:
    ….
    That is a classic “efficiency vs. logic” dilemma.
    Strictly speaking, you should drive. Here is the breakdown of why driving wins this specific round, despite the short distance:
    ...
    * The “Post-Wash” Logic: If you walk there, you’ll eventually have to walk back, get the car, and drive it there anyway. You’re essentially suggesting a pre-wash stroll.
    When should you walk?
    …
    3. You’ve decided the car is too dirty to be seen in public and you’re going to buy a tarp to cover your shame.

irishcoffee · 2026-02-23T20:59:02 1771880342

A few years ago if you asked an LLM what the date was, it would tell you the date it was trained, weeks-to-months earlier. Now it gives the correct date.

What you've proven is that LLMs leverage web search, which I think we've known about for a while.

charcircuit · 2026-02-24T04:25:00 1771907100

Even with search of the AI doesn't know your time zone it can schedule things wrong. You ask it to do something tomorrow, but it ends up doing it later on the same day.

netsharc · 2026-02-23T21:06:53 1771880813

Gemini now "knows the time", I was using it in December and it was still lost about dates/intervals...

irishcoffee · 2026-02-23T21:12:49 1771881169

Yeah, the chat log they saved had the correct date. What's your point?

jiggawatts · 2026-02-24T06:58:05 1771916285

Their system prompt includes the current date and/or their default “tools” includes a set of date and time utilities.