FWIW, I see a critical difference between OP and my reward hacking examples: OP ...

FWIW, I see a critical difference between OP and my reward hacking examples: OP is an example of how reward-shaping can lead to premature convergence to a local optima, which is indeed one of the biggest risks of doing reward-shaping - it'll slow down reaching the global optima rather than speeding it up, compared to the 'true' reward function of just getting a reward for eating a sheep and leaving speed implicit - but the global optima nevertheless remained what the researchers intended. After (much more) further training, the wolf agent learned to not suicide and became hunting sheep efficiently. So, amusing, and a waste of compute, and a cautionary example of how not to do reward-shaping if you must do it, but not a big problem as these things go.

Reward hacking is dangerous because the global optima turns out to be different from what you wanted, and the smarter and faster and better your agent, the worse it becomes because it gets better and better at reaching the wrong policy. It can't be fixed by minor tweaks like training longer, because that just makes it even more dangerous! That's why reward hacking is a big issue in AI safety: it is a fundamental flaw in the agent, which is easy to make unawares, and which will with dumb or slow agents not manifest itself, but the more powerful the agent, the more likely the flaw is to surface and also the more dangerous the consequences become.