Bending pause times to your will with Generational ZGC

KingOfCoders · on March 16, 2024

Java has been pushing out new and better GC constantly the last decades.

I wish Go would invest in GC the same way Java does.

zigzag312 · on March 16, 2024

Same for C#. It's a little weird that it's used in games, but doesn't have a low latency GC.

neonsunset · on March 16, 2024

C# needs GC less desperately so than Java because of structs, stack allocated buffers, heavy use of all manners of object pooling and easy access to malloc and free (which Unity uses in the form of NativeArray).

You can achieve drastically lower allocation rate in C# (or no allocations at all if you really need to).

indolering · on March 16, 2024

There are similar strategies employed by high frequency traders using Java. Not as ergonomic as in C# (which I believe had this as a forethought) but interesting all the same.

pjmlp · on March 16, 2024

.NET has been designed since the beginning to handle C++, C# already had some of the features but not all of them, in some cases it was needed to emit MSIL, or use Managed C++ on .NET 1.0, or C++/CLI since .NET 2.0.

The improvements on C# 7 and later are exactly to avoid emiting MSIL manually, and also cut the dependency on C++/CLI, as it is Windows only.

rwmj · on March 16, 2024

Don't forget third parties like Azul. At one point they even had hardware assisted Java GC.

jeffbee · on March 16, 2024

You can try to explain to people that enabling Transparent Hugepages is going to win them 5-20% CPU throughput and they just won't believe you.

ladberg · on March 16, 2024

Some articles explaining huge pages for anyone curious (and hopefully people will trust that a high frequency trading firm has enough money on the line to get this right):

https://www.hudsonrivertrading.com/hrtbeat/low-latency-optim...

crotchfire · on March 16, 2024

Software people desperately want to believe that MMUs come at no cost whatsoever.

A few of them accept that they have a cost, but insist that it is totally predictable and deterministic.

You can't talk them out of these beliefs. I have tried. They just get very angry at you.

kaba0 · on March 16, 2024

Could you please expand a bit on this topic, or give me some pointers? I’m not intimately familiar with what happens at the OS layer.

mike_hearn · on March 16, 2024

The memory management unit (MMU) is the part of the CPU that creates virtual memory, i.e. a per-process mapping of addresses to physical addresses in the RAM chips. Every time a program accesses memory the addresses must be translated to physical addresses.

The work the MMU does is not free. The mappings between virtual and physical are stored in kernel memory in page tables, which are complex data structures pointed to by system registers. When the CPU executes an instruction that accesses memory (including the stack), it would - without caching - need to do a "page table walk". That means the program is briefly suspended whilst the CPU navigates what is basically a kind of large TreeMap to work out the memory mapping. Because that navigation means reading the page table memory quite a few times, it is slow.

Therefore only reason this scheme is feasible at all is because the CPU caches translations into an on-chip data structure called the translation lookaside buffer. It's kind of a LRU of mappings, with a very high hit rate.

You may have noticed in the past that context switches are slow. An RPC from one process to another is a lot slower than a function call within a process. One of the reasons for that is because the TLB cache has to be flushed when you change to a new address space, otherwise one process would end up reading RAM owned by another. Rebuilding the TLB takes time, so immediately after a context switch everything runs slower for a while until the TLB repopulates.

How large and deep are the page tables? That depends mostly on your page size. Traditionally, x86 machines have used 4kb pages. You can imagine that in a process with tens of gigabytes of data all scattered around the address space that requires a lot of pages and a very big page table as a result. With the move to ARM Apple moved their platforms to 16kb pages, which is part of where they get their great performance and power usage from. It's a tradeoff: bigger pages means more memory may get wasted on padding mappings that are smaller than that value.

Modern Intel/AMD systems support varying the page size, even within a process. This is sometimes called "huge pages" because page sizes can get up to the megabyte range. Obviously if your pages are a 2MB instead of 4KB then the page tables will get way smaller and simpler, meaning page table walks are way faster. Servers have lots of RAM and not very fragmented, so this can be a good tradeoff indeed. The initial version of the huge pages feature was explicit. The kernel can't know if an app will be heavily fragmenting its address space, so apps had to request huge page memory. But changing the software base is really slow and difficult, especially as many devs don't have systems big enough to really test that on. So Linux introduced transparent huge pages (THP) in which the kernel runs a background thread looking for allocations that are right next to each other and which the pages can be coalesced into a huge page. Obviously this is way simpler for the programmer.

Ideally you wouldn't need the MMU at all. You'd just never enable it and run all your software in kernel mode. This can be a big performance improvement because then there's no mapping at all (or a trivial identity mapping). It's never done in practice because:

• Software expects to be able to mmap files and have them be lazily loaded without needing to predict access patterns ahead of time (maybe not so critical these days)

• You can't enforce permissions on kernel code unless you use a hypervisor (which introduces MMU and page tables again)

• It's risky to use native code as the MMU isn't protecting you, if anything goes wrong you can end up with a kernel panic and maybe corrupted data on disk.

But with Java you could at least theoretically do it, as mmap isn't used for reading code there. You'd boot straight into a JVM running in kernel mode. THP might make the performance win too small to bother with though, I never looked in recent years.

The sibling comments observe that all this is a mystery to most Java programmers. Page table walks don't show up in normal high level app profilers because they're executed by the hardware and because the costs are pervasive (every memory access) and because the cache hit rate is so high. So it just ends up being part of the unpredictable low level performance noise of the machine.

crotchfire · on March 16, 2024

Yes, all this. However:

> You may have noticed in the past that context switches are slow

I'm not sure that the situation has improved. In many ways it is getting worse.

TLBs are growing much, much, much more slowly than L2/L3 cache sizes.

This is mainly because on-chip wire delay isn't dropping as quickly as transistor density is rising -- coupling capacitance in densly-wired areas has become a much bigger problem over the last decade. The TLB has to run at the same clock rate as the register file, but it is much larger (both in terms of bits and in terms of geometry).

> But with Java you could at least theoretically do it

Or with a memory-safe, non-garbage-collected language. Hopefully someday we'll have more than one of these. These languages are the final nail in the coffin of microkernels.

mynameisnoone · on March 16, 2024

Everyone knows 4 KiB memory pages just magically appear in the page table and OS kernel data structures with exactly the same computational and storage complexity as 1 GiB ones.

/s in case it wasn't transparent.

It's stupid egotists who get emotional and defend their "alternative facts".

onetimeuse92304 · on March 16, 2024

That is unfortunately my reality working with other Java developers. I mean, it would be if it actually wasn't even worse than that.

They are constantly pushing to create more "microservices" which is just just a misguided trend to spread your functionality over as many machines as possible and have your user wait until a graph of a hundred network calls finishes to get any simplest thing done.

Then they spend huge amount of time trying to resolve performance issues and call me, a "performance expert", to try to help find ways to improve the situation. Then dismiss my calls to maybe modularize their applications using regular programming language mechanisms like interfaces and packages.

Anyway... it's been quite long since I met a Java dev who even knows what MMU and TLB is. It's a rare candidate who knows what is a virtual memory in the first place.

znpy · on March 16, 2024

It does also bring occasional latency spikes.

You can solve that by manually allocating hugepages.

dataflow · on March 16, 2024

Doesn't this depend on your usage pattern?

aseipp · on March 16, 2024

Eh, in the past (maybe 5 years or more) THP was a lot fiddlier IME, and occasionally would hit bizarre cliffs in latency and overall performance. A lot of that has gone away. But because of that (among other things) you'd occasionally see advice like "Turn it off for <xyz particular workload>" so it sort of got carried forward to where people just generally disabled it.

mynameisnoone · on March 16, 2024

Wondering if anyone here runs Azul's C4 in production and how it compares to ZGC.

WJW · on March 15, 2024

I find it incredible how much design room there is in GC research and how much is as yet unexplored. In my opinion, GC (and relatedly, event loop scheduling) are often way too integrated in a language runtime. I would love seeing this become more configurable, and am glad to see developments in this direction like the fiber schedulers in modern Ruby and custom allocators in Zig.

I once inquired about making the Haskell IO manager configurable at runtime but the GHC maintainer team was extremely apprehensive about it because they feared that they would be on the hook for any bugs introduced by third-party IO managers. As it is, Java gets all the GC research love while most other languages make do with very basic GC algorithms. It doesn't have to be like this.

rbranson · on March 16, 2024

The reason is that runtime design forms the constraints of the GC design. Java allows relocating objects, Go does not. Go allows interior pointers, Java does not. These decisions have deep impact on GC design. Something this performance critical needs to be tailored to the runtime.

felixge · on March 16, 2024

AFAIK Go’s spec allows relocating objects. The door towards a moving GC is kept open.

_old_dude_ · on March 16, 2024

There are so many packages that only work with GODEBUG=cgocheck=0 that practically that door is closed

felixge · on March 18, 2024

I wouldn't be so sure. ByteDance (TikTok) implemented a moving GC for Go: https://youtu.be/DpQgJ06ZjGc?si=FviWDGa7HhBRBrO1

arccy · on March 16, 2024

not really that many...

pjmlp · on March 16, 2024

Which is a reason why randomly talking about programming language capabilities with automatic memory management is nonsense, without actually looking into the details.

Unfortunately that is exactly what the manual memory management crowd does.

mynameisnoone · on March 16, 2024

Erlang enters the chat to brag about efficient HiPE GC by tying objects to processes.

Pony enters the chat to brag about ORCA.

Rust enters the chat to brag about precise memory management without GC per se.

Those bastards!

pjmlp · on March 16, 2024

Your Erlang and Pony examples use automatic memory management and only reinforce my remarks.

Additionally, linear and affine type systems isn't the same as manual memory management.

memco · on March 16, 2024

In that same vein of thought Nim recently launched version 2 of the language which makes the ORC GC the default with options to use a few others. I don’t know if there’s new ones on the horizon but there’s already plumbing in case that’s an interest for you to explore further.

exabrial · on March 16, 2024

What sort of heap sizes does g1 vs zgc really start to make a difference?