On modern hardware, it's also not clear that a special case that makes things faster from a purely CPU-oriented perspective makes things faster overall. Adding additional code to handle special cases makes the code larger, and making the code larger can make it slower because of the memory hierarchy.
There's a CPPCast interview with one of the people who worked on the Intel C Compiler where he talks about the things they do to get it producing such fast binaries. At least from what he said, it sounds like the vast majority of the unique optimizations they put into the compiler were about minimizing cache misses, not CPU cycles.
There's a CPPCast interview with one of the people who worked on the Intel C Compiler where he talks about the things they do to get it producing such fast binaries. At least from what he said, it sounds like the vast majority of the unique optimizations they put into the compiler were about minimizing cache misses, not CPU cycles.