It does seem like just doing it in hardware may be a linear gain but isn't a fun...

gizmo686 · on May 9, 2019

Hardware is fundamentally parallel, CPUs are fundamentally serial; it is possible for a hardware solution to have a super-linear speedup in time.

As a simple example, what is the time complexity of zeroing out n bytes of memory. With a CPU, this is O(n). However, with proper hardware support, this can be done in O(1) time.

For a simple garbage collecting example (no idea how their chip does it), consider a simple mark and sweep algorithm. Assume the chip has an internal object graph. At the begining of GC, only root nodes are tagged. At each step, the neighboor of every tagged node is tagged. With a CPU, this step takes at least Ω(n) time, where n is the number of tagged nodes. However, if this is done entirely by hardware, then (depending on how the hardware is designed) each node can independently look at its neighboors and complete a single step in just O(1) time.

Moving stuff to hardware gives you a set of primitives that is asymptotically different than the primitives you have on a general purpose CPU.

AnimalMuppet · on May 9, 2019

But hardware that has to interface with the main memory can't be fundamentally parallel, because of memory bandwidth limitations. If you want to make this part of the memory, then my objection does not apply. But if it's an external chip to the memory, you still are fundamentally serial.

gizmo686 · on May 9, 2019

How much does it need to interface with main memory? If the external chip has its own memory then it can maintain its own view of the object graph. The software can update it when references are made/deleted and query it when an allocation needs to be made.

There is still the overhead of communicating when references are made/deleted, but you need that anyway, as something that only looks at memory doesn't know what is a pointer, or the size of objects.

The communication overhead is then linear in with respect to the amount of updates to the object graph, not the size of the object graph.

You could even go so far as to not put the chip on the memory bus at all.

AnimalMuppet · on May 9, 2019

I'm presuming some kind of a heap. That heap has something like a list of free blocks. Who keeps that list - the main memory, or this other chip? If the main memory does, then when you garbage collect blocks you have to change the main memory. If the chip keeps the list, then yes, the chip can do it all internally. But...

What about multitasking? If there are N programs running, and each has their own heap, the chip has to be able to keep track of each of them. Or it has to be able to keep track of one monster heap that uses almost all of the available memory, even if it's occupied by a bunch of small allocations. All this without using any (main) memory itself. This chip would have to have a large amount of onboard memory itself to pull that off. The very worst scenario would be to be a long way into a run of a program that thrashed the heap, and then the GC chip ran out of slots. You can't (easily) go back and re-do the heap to have the heap control in the main memory, and you can't continue with the chip managing the heap.

moring · on May 9, 2019

You would need an infinite amount of hardware, though, as the O() notations only apply for n towards infinity.

gizmo686 · on May 9, 2019

You already need an infinite amount of hardware. It is just that RAM is no longer the only resource that we assume we have enough of.

Dylan16807 · on May 9, 2019

Can you find this proof? I haven't had any luck. And it's a pretty surprising result to me, assuming "all the garbage" means garbage that was created before the collection cycle started; it's a trivial and useless proof that garbage created 1 nanosecond ago still exists.