This is probably the greatest one-time AI "Benchmark" ever made. The foundation companies have been gaming traditional benchmarks for years so that no one can really match those numbers into real-world experience. Car wash test tells me on the other hand what kind of intelligence i can expect.
Yes. Opus could do a lot better, but fails a lot because it doesn't respect the given formatting instructions/output format.
I could modify the tests to emphasize the requirements, but then, what's the point of a test. In real life, we expect the AI to do something if we ask it, especially for agentic use-case or in n8n, because if the output is slightly wrong, the entire workflow fails.
Interesting. This has to do with the "instruction following" aspect, right? I saw that GPT models do a lot higher than Claude on those benchmarks.
I haven't done my own tests, but I did notice a lot of models are very low there. You'll give them specific instructions and they'll ignore them and just pattern match to whatever was the format they saw most commonly during training.
Yup, for example I tell Claude to return ONLY the answer as "LEFT" or "RIGHT".
And it outputs:
**RIGHT**
With markdown bold formatting... This is probably fine in a chat app, but when you use this in a workflow, it will break the workflow if you then have an if check like if(response === 'RIGHT')...
For me it's interesting because no normal person I know would ever inject "because its better for the environment" in anything so small scale so not only it shows they suck, it shows how easy it is to inject side-ideology into simple exchanges.
You don’t know enough people, then. There are a lot of environmentally conscious people who would absolutely first think “because it is close we should walk” and then follow up with the logical conclusion that you can’t walk to wash your car. Many people communicate by sharing their thinking process, I can think of many people who would share their ideology as it pertains to a question like this. A pragmatic environmentalist (hopefully that is all of them) would know that their ideology isn’t consequential but could certainly mention it. After all, you may need to drive your car to the car wash to wash it, but do you need to wash it? Are the chemicals used by the car wash harmful? Are there better ways to keep a car maintained?
Referring to "the normal people you know" is purely anecdotal evidence and can't be used to infer anything at all about "side-ideology". Perhaps you only know people that don't care about the environment?
Majority of people I know care about the environment but they would never inject a phrase like that in a quick exchange about going to wash the car 50m away is my point. In wanting to be a pure heart you missed the actual point.
Yea, of course they wouldn't inject that when going to a car wash.
If the question was: "I want to go to a cafe 50m away. Should I walk or drive?" I would hope that all of my friends would answer quite a bit more pointed than the LLMs: "Walk you lazy son of a ..., why are you even asking?".
Considering that, I'd say that most LLMs are being quite nice.