To set a HP on Linux, Folly just does a relaxed load of the src pointer, release...

gpderetta · on March 1, 2024

Ah, yes it uses an asynchronous barrier trick. Basically it promotes the compiler barrier to a full barrier. It makes sense for throughput if one side is executed significantly more often than the other like in this case. The cost is latency spikes.

loeg · on March 1, 2024

I don't yet understand how the other side promotes a compiler barrier to a full barrier, but I'll take your word for it and try to read more about it later. :-)

gpderetta · on March 6, 2024

"promoting" is just a short-hand, what happens is a bit more complicated.

First of all remember that the corresponding membar on the collector would only synchronize with the last membar executed on the mutator thread. So if the collector executes significantly less often the mutator, all the executed mutator membars except the last one is wasted overhead. So ideally we want to elide all mutator membars except those that are actually needed.

What actually happens is that the collector thread remotely executes some code (either directly via a signal or indirectly via mprotect or sys_membar) on the mutator thread that executes the #StoreLoad on its behalf. Sending the required interprocess interrupt is very expensive, but this is ideally offsetted by only doing it when truly required.

You can model[1] this as a signal handler executing on the mutator thread that issues an actual atomic_thread_fence to synchronize with the collector, while the mutator itself only need a atomic_signal_fence (i.e. a compiler barrier) to synchronize with the signal handler.

[1] even if this is not necessarily what happens when using mprotect or sys_membar.

senderista · on March 7, 2024

Thanks for explaining the details. In my application though (millions of TPS executing in an MVCC system) I just can't wait possibly tens of ms for membarrier(2) to return: way too much garbage could accumulate in the meantime. From my POV this isn't much better than EBR in terms of low/deterministic latency (it is better in terms of fault-tolerance, if you have out-of-proc clients, but I can reliably detect crashed clients anyway via a Unix domain stream socket and clean up their garbage for them).

senderista · on March 2, 2024

There's a good in-depth discussion here: https://pvk.ca/Blog/2019/01/09/preemption-is-gc-for-memory-r...

loeg · on March 2, 2024

Yeah, kinda, although this is pretty much the entire discussion on how the asymmetric fence works:

> The slow path can execute its write(s) before making a membarrier syscall. Once the syscall returns, any fast path write that has yet to be visible (hasn’t retired yet), along with every subsequent instruction in program order, started in a state where the slow path’s writes were visible.

(I've actually seen this blog post before, but did not remember this part in detail.)