> the meaning I grew up with, which is that a conditional is a branch A conditio...

dkersten · on Feb 10, 2025

Maybe I’m misunderstanding why branching is slow on a GPU. My understanding was that it’s because both sides of the branch are always executed, just one is masked out (I know the exact mechanics of this have changed), so that the different cores in the group can use the same program counter. Something to that effect, at least.

But in this case, would calculating both sides and then using a way to conditionally set the result not perform the same amount of work? Whether you’re calculating the result or the core masks the instructions out, it’s executing instructions for both sides of the branch in both cases, right?

On a CPU, the performance killer is often branch prediction and caches, but on the GPU itself executing a mostly linear set of instructions, or is my understanding completely off? I guess I don’t really understand what it’s doing, especially for loops.

dahart · on Feb 10, 2025

The primary concern is usually over the masking you’re talking about, the issue being simply that you’re proportionally cutting down the number of threads doing useful work. Using Nvidia terminology, if only one thread in a warp is active during a branch, the GPU throughput is 32x slower than it could be with a full warp.

Not all GPU branches are compiled in a straight line without jumps, so branching on a GPU does sometimes share the same instruction cache churn that the CPU has. That might be less of a big deal than thread masking, but GPU stalls still take dozens of cycles. And GPUs waiting on memory loads, whether it’s to fill the icache or anything else, are up to 32x more costly than CPU stalls, since all threads in the warp stall.

Loops are just normal branches, if they’re not unrolled. The biggest question with a loop is will all threads repeat the same number of times, because if not, the threads that exit the loop have to wait until the last thread is done. You can imagine what that might do for perf if there’s a small number of long-tail threads.