State, liveness and pointer reachability

The values that a program can manipulate directly are those held in processor registers, those on the program stack (including local variables and temporaries), and those held in global variables. Such locations holding references to heap data form the roots of the computation. Automatic heap-memory management demands that certain rules be followed by the programmer, dynamically allocated data should only be accessible to the user program through the roots, or by following chains of pointers from these roots. In particular, the program should not access random locations in its address space, for example by picking an arbitrary offset from the base of the heap. This restriction is not unique to garbage collection. It is also enforced by strongly-typed languages such as Pascal. Safe use of C’s explicit malloc/free allocation mechanisms also demands that the user program does not access unallocated regions of memory.

An individually allocated piece of data in the heap will be called, interchangeably, a node, cell or object[1]. The rules above imply that the storage mechanism’s view of the liveness of the graph of objects in the heap is defined by pointer reachability. An object in the heap is live if its address is held in a root, or there is a pointer to it held in another live heap node. More formally, define $$→$$ as the ‘points-to’ relation: for any node or root $$M$$ and any heap node $$N$$, $$M→N$$ if and only if $$M$$ holds a reference to $$N$$. The set of live nodes in the heap is the transitive referential closure of the set of roots under this relation, i.e. the least set[2] live where

$$ live={ N ∈Nodes | (∃r ∈Roots.r →N)∨(∃M ∈live.M →N)}

$$

For the moment, we note that this view of live cells in the heap is only a conservative estimate of the actual set of cells that are potentially accessible to the program. It may include cells that analysis of the program text or data flow analysis by an optimizing compiler would reveal to be dead. Typical examples include a local variable after its last use in a procedure, as yet uninitialized slots in a stack frame, or an obsolete pointer left in a register (to avoid the cost of clearing it). We shall return to this question later in this chapter and also when we consider techniques for conservative garbage collection in Chapter 9.

A node’s liveness may be determined either directly or indirectly. Direct methods require that a record be associated with each node in the heap, of all references to that node from other heap nodes or roots. The most common direct method is to store a count of the number of pointers to this cell, its reference count, in the cell itself. Direct algorithms for distributed systems may instead keep lists of the remote processors that contain references to each object. In either case, these records musts be kept up to date as the mutator alters the connectivity of the graph in the heap.

Indirect or tracing collectors typically regenerate the set of live nodes whenever a request by the user program for more memory fails. The collector starts from the roots and, by following pointers, visits all reachable nodes. These nodes are considered to be live and all memory occupied by other nodes is made available for recycling. If sufficient memory has been recovered, the user program’s request is satisfied and it is restarted.


[1] It will be made clear where the latter term is meant in the object-oriented sense.

[2] Mathematical note: such a least set exists by Tarski’s theorem, which states that any equation of the form $$S=fS$$ where $$f$$ is a monotonic operation on sets, has a least fixed point.

results matching ""

    No results matching ""