You are right, there is appreciable overhead in pipelining and the benefit is no...

You are right, there is appreciable overhead in pipelining and the benefit is not quite as powerful as I claimed. I am guilty of an age-old crime, simplifying a complex subject for the layman and skipping real details in the process.

But you can be certain that AMD and Intel do not design 20+ stage pipelines for some measly 10% performance uplift. The overhead of the pipeline infrastructure is nowhere near the performance gain. Consider Haswell has an IPC around 2 instructions per cycle. With a ~20 stage pipeline, they are indeed far outstripping the performance of "Haswell minus pipelining".

As for the super-deep pipeline in the P4, the consensus I hear is that Intel expected frequency to keep scaling, and as such the P4 was a future-looking architecture designed to scale to 10GHz and beyond.