GC FAQ -- harder stuff

Harder stuff

Thread support

Makes life interesting, since a garbage collection may now occur "at any time". In addition, few thread packages make it easy to stop other threads and retrieve their state, an operation that's required by most garbage collectors.

Atomic operations may require additional code to ensure that they are really atomic (two examples are reference count update and object copying), and some use of memory barriers may be needed to ensure that writes are seen in the same order by all observers. For example, reference counts cannot inconsistently fall to zero (if any observer sees a non-zero reference count, then the memory ought not be reclaimed).

Typical solutions here include working at a machine-dependent level, and mostly removing the threading issues by caching information in a single-threaded context. For example, a heap can be carved up into per-thread pieces, and each thread allocates inpedendently and unsynchronized, except when it needs a new piece. Reference counting can remove much of its dependence on fine-grained synchronization by caching updates in per-thread buffers, and processing the buffers in batch form.

Threads can also be tamed with "safepoints". Before a GC can be started, all threads must be parked at safepoints, where their pointers are guaranteed to be in stable, well-known places. Safepoints have some of their own problems; if, for instance, a low-priority thread must reach a safepoint before GC can begin, GC for the entire process may be delayed.

Parallel collection

Parallel collection makes use of multiple processors to reduce the elapsed time needed by GC. This is especially important on multiprocessors, because if garbage collection does not scale as well as the garbage generator, it becomes the bottleneck. In a stopped-mutator collector, parallelism is not too hairy; it is easy to sweep in parallel, and parallel marking is not too difficult, though it does require some form of load balancing. Compaction can be parallelized by dividing the heap into as many pieces as processors, or the forwarding pointers can be installed using compare-and-swap (on architectures that support this).

Concurrent and incremental collection

"Concurrent" collection means concurrent with the mutator; that is, without pauses. A close approximation of concurrent collection is incremental collection, which is more tractable because it does not have to deal with mutation during the garbage collection algorithm.

One way to add concurrency is to perform non-interfering GC tasks concurrent with mutation. These can include processing card marks and some versions of snapshot (e.g., mark-and-sweep on a copy-on-write mapping) mark-and-sweep collection. See, for example:

"Concurrent Remembered Set Refinement in Generational Garbage Collection" by David Detlefs, Ross Knippel, William D. Clinger, Matthias Jacob, in the Proceedings of 2002 USENIX Java VM Research and Technology Symposium.
Available at http://research.sun.com/jtech/pubs/.

Other concurrent collection algorithms require some amount of mutator cooperation, usually to obtain the equivalent of a snapshot or a write barrier. Another advantage of incremental collection over concurrent collection is that it can adjust the rate of GC if it falls behind; a concurrent collector's rate is affected by changes in OS thread scheduling.

Distributed objects

A reference (e.g. ports, sockets, URLs, etc.) is the distributed equivalent of a pointer. Few existing reference mechanisms support GC. If you consider objects that are widely shared, in a large distributed system, manual collection is a nightmare, and automatic garbage collection is essential.

Lots of people think it's a very hard problem. In fact, complete GC is unfeasible, but simple solutions can collect most garbage.

The problem is that the property "object X is garbage" is a global property, i.e. conceptually you need to examine the whole system (in a consistent state) just to know the status of X. Examining the whole system is expensive, and actually unfeasible in the presence of failures; getting a consistent state is hard. However it is always safe not to collect; so any conservative approximation is OK. Furthermore, garbage objects will not revert, so algorithms that work "eventually" are ok if there is enough slack in the system.

What is different about distributed systems?

You will find general algorithms in the distributed systems literature that appear to solve the above problems, called distributed snapshots, distributed transactions, etc. They will work but they are very costly and they don't scale. Specialized solutions to the specific problem of GC can be much simpler and easier to implement.

Distributed reference counting and reference listing

The easiest GC algorithm to distribute is reference counting. A simple-minded approach is to attach a counter to each object; duplicating or killing a reference sends an increment or decrement message to the target object. This is very expensive (one message per pointer operation) but more importantly doesn't work. Since there is no guaranteed delivery order for messages, an increment might be overrun by a decrement, and the target object unsafely reclaimed.

A general fix is to provide "Causal delivery" but that's overkill. A simpler fix is Weighted Reference Count (or some variation thereof). When an object X is created, it is allocated an initial weight. The first reference to X carries a weight equal to the object's. Duplicating any reference to X divides the weight between the two duplicates. Killing a reference sends a decrement message to the object's weight. At all times, the sum of the reference weights is equal to the object's weight.

In "reference listing" the target object keeps a "scion set", i.e. a list of backpointers to the processes that hold pointers to it. This is made possible by the following observation: if process P owns object X, process Q may acquire a reference to X only if P sends a message to Q containing the reference. At message sending time, record the name Q in the reference list of X. When Q kills its (last) reference to X, it sends a decrement message which removes Q from X's scion list. When the scion list is empty, X can be reclaimed.

Reference listing is less sensitive to failures than counting. If process Q crashes, P can unilaterally remove its scion from all its reference lists (but beware: it is impossible to reliably detect crashes!). If a decrement message is lost, P can query Q. Reference listing is heavily used in Partitioned GC.

Reference counting and listing scale well. They cannot collect cycles of garbage. Another drawback in practice is that they do not provide "root identity", i.e. if there are multiple roots (e.g. process roots and persistent roots), there is no way to know from which a specific object is reachable.

Distributed Mark and Sweep

For concreteness, I will concentrate on Mark and Sweep; but all tracing methods (whether mark and sweep, copying, or other) have a synchronization point and hence will not scale well. On the other hand, tracing does collect cycles and supports root identity. It is also fault tolerant (in a perverse way): if anything goes wrong, you can abort a trace and start again from scratch.

The simple-minded approach, where each process marks and sweeps in parallel, sending "mark" messages down remote references, doesn't work. Suppose P has a reference to object X in process Q, and Q has no local reference to X. Then Q shouldn't start sweeping until it is certain that it will receive no more mark messages from P. Of course, P must do the same with respect to Q. Thus, every process must wait for all others between the mark and the sweep phases (this is called a "termination protocol").

The Hughes algorithm overcomes this problem somewhat by using timestamps to mark reachable objects (rather than a single mark bit). The idea is that the timestamp of a reachable object will always increase; after an object becomes unreachable its timestamp ceases to increase; objects whose timestamp is below some global minimum are garbage. The catch is that computing the global minimum is a termination protocol. The good thing is that it can be done in the background.

Ladin and Liskov take an alternative approach. The idea here is that each process sends (an approximation of) its local graph to some central service. The central service combines the partial graphs into a global graph, which it marks and sweeps. There are a few problems with this approach. The first is that the central service is a bottleneck and a single point of failure (Ladin and Liskov address this problem by replication). The second is that the partial graphs are mutually inconsistent (Ladin and Liskov address this by timestamping and being conservative). The third is that getting the partial graph information is harder than it seems (Ladin and Liskov got it wrong the first time around).

Partitioned or Hybrid GC

Partitioned GC combines the advantages of Distributed Reference Counting or Listing and of Distributed Mark and Sweep. The global memory is partitioned; GC is hybrid: tracing within a subset, counting (or listing) across subsets.

The idea is that when some subset of the global memory (e.g. a process) sends a pointer to another subset, that pointer is recorded in a "scion set" (akin to a remembered set). You keep a count of such inter-subset references. Thus, each subset can independently and locally perform tracing of any kind. Inter-subset references are reference-counted.

The difference between a remembered set and a scion set is that in the former case the memory subsets are ordered, so no cycles of garbage can form. In the latter case there is no natural ordering and inter-subset garbage cycles cannot be collected.

Variations on this theme have to do with doing the whole job more efficiently and tolerating faults ("SSP Chains": Shapiro, Dickman and Plainfossè), collecting cycles of garbage that span subsets (Hughes, or "Garbage Collecting the World": Lang Queinnec and Piquer), avoiding expensive memory synchronization and I/O operations ("Larchant": Ferreira and Shapiro), using the counting information to choose the best subset to trace (Cook Wolf and Zorn, or "Sprite LFS": Rosenblum and Ousterhout), or avoiding problems with transaction aborts (Amsaleg Gruber and Franklin).

DSM, cached distributed stores

See also: GC of persistent stores.

Databases, persistent stores, and file systems

What is different: Reference counting and listing, mark and sweep: see distributed systems

Partitioned GC

Tolerating transaction rollback and inconsistent data Clustering

Uncooperative environments

Garbage collectors designed to work with C and C++ cannot rely on compiler support for garbage collection, because there is none. Such garbage collectors cannot know exactly where pointers are stored on the stack, or in which variables, nor can they necessarily know the exact layout of allocated objects. The usual way to solve this problem is with a conservative garbage collector. Relocation (compaction) of objects referenced by these uncertain pointers is not possible.

In cases where the C is machine-generated (in compilation of some other language, such as Sather, Eiffel, Scheme, or Modula-3), code to record active pointers can be generated.

There is a theoretical problem posed by sufficiently good optimizing compilers. In principle, there is no reason why the generated code should reference objects by a pointer to the start of the object, or even to its interior. In practice this is not a problem.

Hardware support

Recent work

Kelvin Nilsen's approach to hardware-assisted real-time garbage collection is to insert special hardware between the cache and the memory subsystem. The role of this hardware is to:
  1. Enforce read barriers by snooping on memory fetch operations and inserting pipeline stalls and/or requiring fetch retries whenever special handling is required.
  2. Maintain a log of fetch and store operations for purposes of updating cross-generational link tables for generational garbage collectors.
  3. Maintain a larger-than-normal write buffer and provide programmable support for implementation of custom write barriers.
  4. Provide a very small programmable coprocessor to assist with garbage collection efforts by performing such routine tasks as scanning regions of memory for pointers and copying blocks of memory from one region to another. This coprocessor resides next to memory so its activities do not generally interfere with the contents of the main CPU's cache.
The garbage collection hardware can be configured to support a variety of different garbage collection techniques including two-space copying, incremental mark and sweep, generational, and various hybrid techniques. For more information, refer to http:www.newmonics.com/WebRoot/technologies/gc.html

OS support

Compiler support

GC FAQ table of contents
Techniques and algorithms
Language interfaces
Difficult topics
Silly stuff