> It does not seem that there is any significant correlation between memory size...

> It does not seem that there is any significant correlation between memory size and allocation cost. Nor should there be - it's just bookkeeping, after all.

I guess this explains your lack of concern about memory pooling. I am going to have to do some research to see if OpenCL does pooling natively, but I can assure you that this is not the behavior on Cuda devices. I can't really get your example to compile right now, but I'll look into it later.

Also you are right that is strange that allocations are so slow on Cuda.

But if it turns out that OpenCL is doing pooling behind the scenes, this is going to be an issue for you when you decide to do a Cuda backend because the performance profile will start to get dominated by allocations.

> This would have the same effect as embedding assembly into C code, i.e. it basically makes everything nearby off-limits for optimisation.

This is true, but that would also be the case if you were using an FFI. Being able to embed C code would shorten that path a bit by not requiring the user to compile a code fragment to a .dll before invoking it.

> In fact, Futhark is a bit of a counter-reaction to the run-time-code-generating embedded languages that were popular in the FP community some years ago.

Hmmm...I never heard about this. But Then again, relatively speaking I haven't been programming that long.

> Have you considered looking at something like Obsidian[2]? It is a much lower level functional language that directly exposes GPU concepts. While Obsidian is Haskell-specific, there is a successor called FCL[3] that is standalone. Unlike Futhark, these languages do not depend on aggressive and expensive compiler transformations, and therefore may be a better fit for embedding.

I can't find any documentation for Obsidian, but there is a paper for FCL in the Github, so I'll look at that.