> the meaning I grew up with, which is that a conditional is a branch
A conditional jump is a branch. But a branch has always had a different meaning than a generic “conditional”. There are conditional instructions that don’t jump, e.g. CMP, and the distinction is very important. Branch or conditional jump means the PC can be set to something other than ‘next instruction’. A conditional, such a conditional select or conditional move, one that doesn’t change the PC, is not a branch.
> take a continuous function […] Then you’d just have an f(x) that has no branching
One can easily implement conditional functions without branching. You can use a compare instruction followed by a Heaviside function on the result, evaluate both sides of the result, and sum it up with a 2D dot product (against the compare result and its negation). That is occasionally (but certainly not always) faster on a GPU than using if/else, but only if the compiler is otherwise going to produce real branch instructions.
Maybe I’m misunderstanding why branching is slow on a GPU. My understanding was that it’s because both sides of the branch are always executed, just one is masked out (I know the exact mechanics of this have changed), so that the different cores in the group can use the same program counter. Something to that effect, at least.
But in this case, would calculating both sides and then using a way to conditionally set the result not perform the same amount of work? Whether you’re calculating the result or the core masks the instructions out, it’s executing instructions for both sides of the branch in both cases, right?
On a CPU, the performance killer is often branch prediction and caches, but on the GPU itself executing a mostly linear set of instructions, or is my understanding completely off? I guess I don’t really understand what it’s doing, especially for loops.
The primary concern is usually over the masking you’re talking about, the issue being simply that you’re proportionally cutting down the number of threads doing useful work. Using Nvidia terminology, if only one thread in a warp is active during a branch, the GPU throughput is 32x slower than it could be with a full warp.
Not all GPU branches are compiled in a straight line without jumps, so branching on a GPU does sometimes share the same instruction cache churn that the CPU has. That might be less of a big deal than thread masking, but GPU stalls still take dozens of cycles. And GPUs waiting on memory loads, whether it’s to fill the icache or anything else, are up to 32x more costly than CPU stalls, since all threads in the warp stall.
Loops are just normal branches, if they’re not unrolled. The biggest question with a loop is will all threads repeat the same number of times, because if not, the threads that exit the loop have to wait until the last thread is done. You can imagine what that might do for perf if there’s a small number of long-tail threads.
A conditional jump is a branch. But a branch has always had a different meaning than a generic “conditional”. There are conditional instructions that don’t jump, e.g. CMP, and the distinction is very important. Branch or conditional jump means the PC can be set to something other than ‘next instruction’. A conditional, such a conditional select or conditional move, one that doesn’t change the PC, is not a branch.
> take a continuous function […] Then you’d just have an f(x) that has no branching
One can easily implement conditional functions without branching. You can use a compare instruction followed by a Heaviside function on the result, evaluate both sides of the result, and sum it up with a 2D dot product (against the compare result and its negation). That is occasionally (but certainly not always) faster on a GPU than using if/else, but only if the compiler is otherwise going to produce real branch instructions.