Chandy & Lamport
Distributed Snapshots: Determining Global States of Distributed
Systems


The goal of the important algorithm presented here (I don't see the
actual determination of whether the "stable property" holds as being
the important point of the paper) is to cause the various nodes
(processes) in a distributed system to collectively capture at least a
"good enough" view of the global state (state of each node plus
in-flight data in the communication channels).  Although they seem to
be focused only on collecting this state snapshot for the purpose of
recognizing stable properties, the algorithm has greater relevance
because (in my opinion) the snapshots they collect, while potentially
reflecting a state that never was, are appropriate for many other
applications because they always reflect a state that could have been
between the time the capture was started and when it finished.

The algorithm described is essentially epidemic.  One node (what
happens if multiple nodes decide concurrently?) spontaneously decides
to capture its state.  After a node captures its state, and before
doing anything else, it sends a marker on each out-link.  Reading a
marker forces the recipient to capture its own state before doing
anything else, and thereby propagate the capture to its own out-links.
Any messages that arrive at a post-capture node from in-links for
which no marker has yet been received are captured and assigned in the
snapshot as being in-flight on that link.  Because the capture takes a
long time to complete, the system can be in a number of states while
it is happening.  Ideally the captured state would be precisely one of
these occuring, valid states.  In reality however, it can differ
depending on the order in which messages and markers are delivered and
read at the various nodes.  Instead they argue that the captured state
is always interchangeable with the valid states for their purposes (of
determining if the computation exhibits a stable property).

The relationship between this paper and the Lamport paper on time &
clocks comes from the notion of what it means for two events to be
"concurrent" or causally unrelated.  When the captured global state
does not reflect a true system snapshot it is always because the
ordering of one or more pairs of events has been reversed in the
captured view of what happened.  For their purposes it suffices to
show that the captured state is plausible in the sense that it is
reachable from the start-of-capture state, and that the end-of-capture
state is reachable from it.  But, as I see it, we can also say that
any event pairs whose order is perturbed between the true computation
and the capture were causally unrelated (in the terminology of the
other paper, neither event can be said to have "happened before" the
other).  Therefore who's to say that any of the actual system states
that occured during the capture interval (i.e. the set of "ideal" capture
outputs) are either more or less valid than the "imaginary" result
given by the algorithm?


Andy