Well, GPUs don't have any branch prediction or out of order capabilities, so you need to have a way to keep execution units (mainly floating point units) busy.
A WARP is really nothing more than a way to have work for SMXs (and computational units it controls) at as many clock cycles as possible. You need some way for masking FPU pipeline and memory latency.
> All warps are running in parallel (otherwise you won't get the performance numbers) and each has its own control path (actually each has its own code)
It's not that different from x86 hyperthreading, just with more hardware threads. Pipelined execution units are fed each clock cycle by the core. Multiple FP operations are in flight in parallel, otherwise CPUs won't get the performance numbers either.
Sure, an SMX can also switch between warps in the manner similar to hyperthreading on x86 but it does not mean it executes a single warp at a time. Consider Tesla K40, a GK110 with 15 SMXs. It runs 750Mhz and has peak performance of 4.29 Tflops. If each SMX could only execute a warp at a time it could get, at most, 15(number of smx) x 32(warp width) x 750M(frequency) x 2(two flops per FMA) = 720Gflops.
The Tesla K40 has peak double performance of ~1.4 TFLOPS. It has 64 DP cores, the warp scheduler can schedule four warps per smx per cycle. It can therefore have two warps executing double instructions at the same time. But the number is not very interesting, the memory bandwidth on the other hand is, a GK110 has 288GB/s, take you code, get it's arithmetic intensity and you have a upper bound for your performance, assuming you are memory bound of course.
A WARP is really nothing more than a way to have work for SMXs (and computational units it controls) at as many clock cycles as possible. You need some way for masking FPU pipeline and memory latency.
> All warps are running in parallel (otherwise you won't get the performance numbers) and each has its own control path (actually each has its own code)
It's not that different from x86 hyperthreading, just with more hardware threads. Pipelined execution units are fed each clock cycle by the core. Multiple FP operations are in flight in parallel, otherwise CPUs won't get the performance numbers either.