Looks like you're mixing up two things when testing: the correct answer and format following. If you want both, why not use https://platform.claude.com/docs/en/build-with-claude/struct... ? If you don't care about the structure, why penalise the correct answers? In realistic usage people don't say "I really care about the format a lot... but not enough to guarantee it".
Because the format can't also be strictly defined via structured output, and you have to write it in plain words. Imagine you also have a field within your JSON, which also needs a specific format. It's AI, you don't want to write a 2000lines JSON schema to define what you need and how to parse it, that's the point of using AI instead of writing your own data extraction script.
Also, simply because a human would respect it properly. And it's quite clear what the request was.
Thanks for the suggestion to separate format following from correct answer, good idea, I'll think about it.
Still, some good AIs do it properly, and as expectedly, why would I change the tests specifically for Claude, which is basically the only one with this problem.
I use structured format in many of live AI systems, maybe my point was not clear.
For some tasks it's impossible to define a JSON schema. Let's say you want the message to end with "Thank you", in any language. Should I add in my schema 200 possible endings? What about all their variations and declinations in various languages?
Sometimes you have to define in natural language how you want the output to look like.