I had a mentor who worked at Intel labs when this was happening.
The reason this died was because someone invented a gc algorithm which consistently outperformed this, leading Intel to drop their hardware gc plans.
Seems to be one of the main risks for any specialized circuits, if I understand you correctly. You always have to guess "will this really be relevant long enough to invest the money to bake it into hardware?" .. and if you guess wrong you just wasted a part of your silicon budget for something no one will use.
When i thought about this (using my programmer brain which knows nothing about hardware), i came up with the idea of a 'store pointer' instruction, which took two addresses and an offset, and stored the first address into the field of the object pointed to by the second address. And also, if the two addresses referred to different memory regions, recorded the pair of addresses into some kind of buffer on the processor. When that buffer got full, the processor would trap to some preconfigured location.
That could be used as a basis for a write barrier.
The devil would be in the detail of how the regions were defined.
And maybe the trapping would mean this wasn't even all that fast.