Fair I cleaned up the wording with ChatGPT with my review prompt. The substance matters more than the style. If a model flips 3/10 times on a trivial constraint, that’s a reliability issue, not a reasoning ceiling.
> If a model flips 3/10 times on a trivial constraint, that’s a reliability issue, not a reasoning ceiling.
I have reviewed your previous comments, and you have consistently written: that's instead of that’s. So what I read is still some LLM output, even though I think there is some kind of human behind the LLM.