Chandy & Lamport Distributed Snapshots: Determining Global States of Distributed Systems The goal of the important algorithm presented here (I don't see the actual determination of whether the "stable property" holds as being the important point of the paper) is to cause the various nodes (processes) in a distributed system to collectively capture at least a "good enough" view of the global state (state of each node plus in-flight data in the communication channels). Although they seem to be focused only on collecting this state snapshot for the purpose of recognizing stable properties, the algorithm has greater relevance because (in my opinion) the snapshots they collect, while potentially reflecting a state that never was, are appropriate for many other applications because they always reflect a state that could have been between the time the capture was started and when it finished. The algorithm described is essentially epidemic. One node (what happens if multiple nodes decide concurrently?) spontaneously decides to capture its state. After a node captures its state, and before doing anything else, it sends a marker on each out-link. Reading a marker forces the recipient to capture its own state before doing anything else, and thereby propagate the capture to its own out-links. Any messages that arrive at a post-capture node from in-links for which no marker has yet been received are captured and assigned in the snapshot as being in-flight on that link. Because the capture takes a long time to complete, the system can be in a number of states while it is happening. Ideally the captured state would be precisely one of these occuring, valid states. In reality however, it can differ depending on the order in which messages and markers are delivered and read at the various nodes. Instead they argue that the captured state is always interchangeable with the valid states for their purposes (of determining if the computation exhibits a stable property). The relationship between this paper and the Lamport paper on time & clocks comes from the notion of what it means for two events to be "concurrent" or causally unrelated. When the captured global state does not reflect a true system snapshot it is always because the ordering of one or more pairs of events has been reversed in the captured view of what happened. For their purposes it suffices to show that the captured state is plausible in the sense that it is reachable from the start-of-capture state, and that the end-of-capture state is reachable from it. But, as I see it, we can also say that any event pairs whose order is perturbed between the true computation and the capture were causally unrelated (in the terminology of the other paper, neither event can be said to have "happened before" the other). Therefore who's to say that any of the actual system states that occured during the capture interval (i.e. the set of "ideal" capture outputs) are either more or less valid than the "imaginary" result given by the algorithm? Andy