Well, GPUs don't have any branch prediction or out of order capabilities, so you...

pandaman · on April 11, 2016

Sure, an SMX can also switch between warps in the manner similar to hyperthreading on x86 but it does not mean it executes a single warp at a time. Consider Tesla K40, a GK110 with 15 SMXs. It runs 750Mhz and has peak performance of 4.29 Tflops. If each SMX could only execute a warp at a time it could get, at most, 15(number of smx) x 32(warp width) x 750M(frequency) x 2(two flops per FMA) = 720Gflops.

0x07c0 · on April 11, 2016

The Tesla K40 has peak double performance of ~1.4 TFLOPS. It has 64 DP cores, the warp scheduler can schedule four warps per smx per cycle. It can therefore have two warps executing double instructions at the same time. But the number is not very interesting, the memory bandwidth on the other hand is, a GK110 has 288GB/s, take you code, get it's arithmetic intensity and you have a upper bound for your performance, assuming you are memory bound of course.

https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK11...

https://www.nvidia.com/content/tesla/pdf/nvidia-tesla-k40-20...