Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> In general, I’d say that GPT-4 was strongest on true/false questions and (ironically!) conceptual questions—the ones where many students struggled the most. It was (again ironically!) weakest on calculation questions, where it would often know what kind of calculation to do but then botch the execution.

It'd be great if chain-of-thought / show-your-work type prompts became the default for anything involving complex, multi-step calculations or logic.

GPT-4 would have almost certainly gotten a higher score on the calculation questions if those methods were used.



Eh, even when asked specifically to show it's work GPT-4 still frequently makes calculation errors. It's just one of the limitations of current LLMs, and can easily be solved by integration with Wolfram, or even just a basic calculator.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: