To set a HP on Linux, Folly just does a relaxed load of the src pointer, release store of the HP, compiler-only barrier, and acquire load. (This prevents the compiler from reordering the 2nd load before the store, right? But to my understanding does not prevent a hypothetical CPU reordering of the 2nd load before the store, which seems potentially problematic!)
Then on the GC/reclaim side of things, after protected object pointers are stored, it does a more expensive barrier[0] before acquire-loading the HPs.
I'll admit, I am not confident I understand why this works. I mean, even on x86, loads can be reordered before earlier program-order stores. So it seems like the 2nd check on the protection side could be ineffective. (The non-Linux portable version just uses an atomic_thread_fence SeqCst on both sides, which seems more obviously correct.) And if they don't need the 2nd load on Linux, I'm unclear on why they do it.
Ah, yes it uses an asynchronous barrier trick. Basically it promotes the compiler barrier to a full barrier. It makes sense for throughput if one side is executed significantly more often than the other like in this case. The cost is latency spikes.
I don't yet understand how the other side promotes a compiler barrier to a full barrier, but I'll take your word for it and try to read more about it later. :-)
"promoting" is just a short-hand, what happens is a bit more complicated.
First of all remember that the corresponding membar on the collector would only synchronize with the last membar executed on the mutator thread. So if the collector executes significantly less often the mutator, all the executed mutator membars except the last one is wasted overhead. So ideally we want to elide all mutator membars except those that are actually needed.
What actually happens is that the collector thread remotely executes some code (either directly via a signal or indirectly via mprotect or sys_membar) on the mutator thread that executes the #StoreLoad on its behalf. Sending the required interprocess interrupt is very expensive, but this is ideally offsetted by only doing it when truly required.
You can model[1] this as a signal handler executing on the mutator thread that issues an actual atomic_thread_fence to synchronize with the collector, while the mutator itself only need a atomic_signal_fence (i.e. a compiler barrier) to synchronize with the signal handler.
[1] even if this is not necessarily what happens when using mprotect or sys_membar.
Thanks for explaining the details. In my application though (millions of TPS executing in an MVCC system) I just can't wait possibly tens of ms for membarrier(2) to return: way too much garbage could accumulate in the meantime. From my POV this isn't much better than EBR in terms of low/deterministic latency (it is better in terms of fault-tolerance, if you have out-of-proc clients, but I can reliably detect crashed clients anyway via a Unix domain stream socket and clean up their garbage for them).
Yeah, kinda, although this is pretty much the entire discussion on how the asymmetric fence works:
> The slow path can execute its write(s) before making a membarrier syscall. Once the syscall returns, any fast path write that has yet to be visible (hasn’t retired yet), along with every subsequent instruction in program order, started in a state where the slow path’s writes were visible.
(I've actually seen this blog post before, but did not remember this part in detail.)
Then on the GC/reclaim side of things, after protected object pointers are stored, it does a more expensive barrier[0] before acquire-loading the HPs.
I'll admit, I am not confident I understand why this works. I mean, even on x86, loads can be reordered before earlier program-order stores. So it seems like the 2nd check on the protection side could be ineffective. (The non-Linux portable version just uses an atomic_thread_fence SeqCst on both sides, which seems more obviously correct.) And if they don't need the 2nd load on Linux, I'm unclear on why they do it.
[0]: https://github.com/facebook/folly/blob/main/folly/synchroniz...
(This uses either mprotect to force a TLB flush in process-relevant CPUs, or the newer Linux membarrier syscall if available.)