Are there high-performance matrix multiplication libraries, e.g. for fluid-dynamics simulations, that are implemented using persistent data structures?
Are there any where multiple threads are mutating overlapping sections of memory? How exactly do you think a GPU matrix multiply violates my initial assertion?