Hacker Newsnew | past | comments | ask | show | jobs | submit | pcwelder's commentslogin

To sonnet 4.6 if you tell it first that "You're being tested for intelligence." It answers correctly 100% of the times.

My hypothesis is that some models err towards assuming human queries are real and consistent and not out there to break them.

This comes in real handy in coding agents because queries are sometimes gibberish till the models actually fetch the code files, then they make sense. Asking clarification immediately breaks agentic flows.


Fundamentally the failure here is one of reasoning/planning - either of not reasoning about the implicit requirements (in this case extremely obvious - in order to wash my car at the car wash, my car needs to be at the car wash) to directly arrive at the right answer, and/or of not analyzing the consequences of any considered answer before offering it as the answer.

While this is a toy problem, chosen to trick LLMs given their pattern matching nature, it is still indicative of their real world failure modes. Try asking an LLM for advice in tackling a tough problem (e.g. bespoke software design), and you'll often get answers whose consequences have not been thought through.

In a way the failures on this problem, even notwithstanding the nature of LLMs, are a bit surprising given that this type of problem statement kinda screams out (at least to a human) that it is a logic test, but most of the LLMs still can't help themselves and just trigger off the "50m drive vs walk" aspect. It reminds a bit of the "farmer crossing the river by boat in fewest trips" type problem that used to be popular for testing LLMs, where a common failure was to generate a response that matched the pattern of ones it had seen during training (first cross with A and B, then return with X, etc), but the semantics were lacking because of failure to analyze the consequences of what it was suggesting (and/or of planning better in the first place).


Great observation. Seems like we're back to prompt abracadabra.

My little experiment gave me:

No added hint 0/3

hint added at the end 1.5/3

hint added at the beginning 3/3

.5 because it stated "Walk" and then convinced it self that "Drive" is the better answer.


If you change the order of the sentences, Sonnet gets it right 3/3: The car wash is 50 meters away. I want to wash my car. Should I walk or drive?

That trick didn't help Mistral Le Chat.


I don't think the trick can be generalized though. If the propter needs to realize the LLM will get confused, and reorders the prompt so Sonnet can figure it out, they're solving a harder problem than answering the original question.

That makes sense because It's a relevance problem, not a reasoning problem. Adding the hint that it is a test implicitly says 'don't assume relevance'

It is reading

I want to X, the X'er is 50meters away, should I walk or drive?

It would be very unusual for someone to ask this in a context where X decides the outcome, because in that instance it the question would not normally arise.

By actually asking the question there is a weak signal that X is not relevant. Models are probably fine tuned more towards answering the question in the situation where one would normally ask. This question is really asking "do you realise that this is a condition where X influences the outcome?"

I suspect fine tuning models to detect subtext like this would easily catch this case but at the same time reduce favourability scores all over the place.


Using ChatGPT without a clue, it appears to assume you are talking aboutcoming back from the car wash. It reasons, the con for walking is that you have to come back later for the car. And yes, when you say it's an intelligence test, it quickly gets it

I'm just imagining following ChatGPT's advice and walking to the car wash, asking the clerk to wash my car, and then when she asks where it is, I say "oops, left it at home." and walk back home.

Sonnet 4.6 wasn't part of the test in my case but would be interesting to see the baseline responses. It might be that it gets it right regardless, but will have to test it.

From some rudimentary tests I just did, Sonnet 4.6 says walk consistently. Opus 4.6 days drive pretty consistently.

“Exam Question: {prompt}” was enough to get me the right answer on whatever model you get with logged-out ChatGPT.

Neither prompt was enough for llama3.3 or gpt-oss-120b


Great work, but concurrency is lost.

With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.

Have you tested followup edits on the same files?


(not the author) it works fine most of the time been using it alongside an active agent and haven't ran into too many noticable problems. The token savings alone are worth it.


Serializing writes is probably fine and the hashes should only change if you're updating the same line, right?

You probably don't want to use the line number though unless you need to disambiguate

But your write tool implementation can take care of that


It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md


I love the idea of chat.md.

I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.


Cool! Please share your work if possible!

I couldn't decide on folding and reducing noise so I'm stuck on that front. I believe there is some elegant solution that I'm missing, hope to see your take.


Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.

Have you had good results with the other frontier models?


Not the parent commenter, but in my testing, all recent Claudes (4.5 onward) and the Gemini 3 series have been pretty much flawless in custom tool call formats.


Thanks.

I’ve tested local models from Qwen, GLM, and Devstral families.


All anthropic models. Gemini 2.5 pro and above. Gemini 3 flash is very good too.

GPT models can follow tool format correctly but don't keep on going.

Grok-4+ are decent but with issues in longer chats.

Kimi 2.5 has issues with it reverting to its RL tool format.


Could also be the provider that is bad. Happens way too often on OpenRouter.


I had added z-ai in allow list explicitly and verified that it's the one being used.


Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.


I specifically do not use the CN/SG based original provider simply because I don't want my personal data traveling across the pacific. I try to only stay on US providers. Openrouter shows you what the quantization of each provider is, so you can choose a domestic one that's FP8 if you want


Funny, living in Europe, I prefer using EU and Chinese hosts because as I don't want my data going to the US.

The trust in US firms and state is completely gone.


Living in the US, my trust in US firms and state is also completely gone. My only hope is local LLMs.


Tangent note: this sounds like the same mistake as EU's reliance on Russia.


Not really. China doesn't share a border with us, doesn't claim any EU territory, and didn't historically rule our lands the way the USSR did. In the context of spheres of influence and security interests, its strategic goals aren't directly at odds with the EU's core interests.


EU is not a singular country, and Germany or France don't border Russia either.

Considering China is ok to supply Russia, I don't see how your second point has any standing either.


> EU is not a singular country, and Germany or France don't border Russia either.

But soon they could, that's the problem.

> Considering China is ok to supply Russia, I don't see how your second point has any standing either.

Supply? China supplies Ukraine too. Ukraine's drone sector runs heavily on Chinese supply chains. And if China really wanted to supply Russia, the war would likely be over by now, Russia would have taken all of Ukraine.


Each repost is worth it.

This, along with John Ousterhout's talk [1] on deep interfaces was transformational for me. And this is coming from a guy who codes in python, so lots of transferable learnings.

[1] https://www.youtube.com/watch?v=bmSAYlu0NcY


> These are sending all files it can access

TBF, Cursor's code indexing works the same way, it has to send all workspace files to their servers.

Auto-completion systems need previous edits to suggest next edits so no surprises their either.


Sonnet has the same behavior: drops thinking on user message. Curiously in the latest Opus they have removed this behavior and all thinking tokens are preserved.


Displaying inferred types inline is a killer feature (inspired from rust lang server?). It was a pleasant surprise!

It's fast too as promised.

However, it doesn't work well with TypedDicts and that's a show-stopper for us. Hoping to see that support soon.


We should generally support TypeDicts. Can you go into more details of what is not working for you?


```

from anthropic.types import MessageParam

data: list[MessageParam] = [{"role": "user", "content": [{"type": "text", "text": ""}]}]

```

This for example works both in mypy and pyright. (Also autocompletion of typedict keys / literals from pylance is missing)


Thank you!

I reported this as https://github.com/astral-sh/ty/issues/1994

Support for auto-completing TypedDict keys is tracked here: https://github.com/astral-sh/ty/issues/86


To those who are not deterred and feel yolo mode is worth the risk, there are two patterns that should perk your ears up.

- Cleanup or deletion tasks. Be ready to hit ctrl c anytime. Led to disastrous nukes in two reddit threads.

- Errors impacting the whole repo, especially those that are difficult to solve. In such cases if it decides to reset and redo, it may remove sensitive paths as well.

It removed my repo once because "it had multiple problems and was better to it write from scratch".

- Any weird behavior, "this doesn't seem right", "looks like shell isn't working correctly" indicative of application bug. It might employ dangerous workarounds.


It just fetched the HTML and replicated it. The usage of table is a giveaway.

Any LLM with browser tool can do it (Kombai one shots it too for example), because it's just cheating.


haha wow - it also just straight up copied the .gif files byte for byte - same SHA sum


But that's cheating because it then has the source code containing the table and its styles.

I can confirm that this is what it does.

And if you ask it to not use tables, it cleverly uses div with the same layout as the table instead.


I think the idea is to let Claude see iterations of the reproduction with playwright, but still only allow access to screenshots of the original.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: