To the contrary, this is my biggest complaint with CUDA. It's a nice seeming abs...

To the contrary, this is my biggest complaint with CUDA. It's a nice seeming abstraction from a programmer's perspective, unfortunately it's not a good abstraction b/c it doesn't match the reality of hardware. The truth is there is a heterogeneous arena of memory kinds, access patterns, and sizes that gives you drastically different performance trade offs for different tasks. (This isn't even about diverse hardware, every modern GPU has this complexity.) CUDA oversimplifies this which causes solutions to have opaque performance cliffs and you end up having to understand a lot of the underlying complexity and then awkwardly back-propagate that up into CUDA's ill-fit APIs to get decent performance. It's a false sense of simplicity that ends up causing more work and complexity.

Contrast that with something like WebGPU where their notions of GPU buffers, textures, pipelines and command queues maps well into what actually happens in hardware and it's much simpler to get predictable performance.

Now I totally agree there needs to be more work done to provide simpler abstractions on top of WebGPU/Vulkan/Metal/DirectX for certain common patterns of work. But pretending you have a pointer to a blob of GPU memory isn't the way.

This talk gives a great overview of the GPU compute landscape: https://www.youtube.com/watch?v=DZRn_jNZjbw