You are right, there is appreciable overhead in pipelining and the benefit is not quite as powerful as I claimed. I am guilty of an age-old crime, simplifying a complex subject for the layman and skipping real details in the process.
But you can be certain that AMD and Intel do not design 20+ stage pipelines for some measly 10% performance uplift. The overhead of the pipeline infrastructure is nowhere near the performance gain. Consider Haswell has an IPC around 2 instructions per cycle. With a ~20 stage pipeline, they are indeed far outstripping the performance of "Haswell minus pipelining".
As for the super-deep pipeline in the P4, the consensus I hear is that Intel expected frequency to keep scaling, and as such the P4 was a future-looking architecture designed to scale to 10GHz and beyond.
But you can be certain that AMD and Intel do not design 20+ stage pipelines for some measly 10% performance uplift. The overhead of the pipeline infrastructure is nowhere near the performance gain. Consider Haswell has an IPC around 2 instructions per cycle. With a ~20 stage pipeline, they are indeed far outstripping the performance of "Haswell minus pipelining".
As for the super-deep pipeline in the P4, the consensus I hear is that Intel expected frequency to keep scaling, and as such the P4 was a future-looking architecture designed to scale to 10GHz and beyond.