|
|
 |
|
Project 5: Generation Numbers
Goal
The basic operation of SNet requires that a node be able to determine
which of two update messages generated by some other member is the most recent.
Thus, generation numbers must be ordered. In our case, they are strictly increasing.
A practical issue that immediately arises as a side effect is that transient bugs can have
long term consequences:
- If a node, say elmo,
loses its own generation (due, say, to a disk crash), and then restarts its
generation numbering at 0, it may take it a long time to reach its now lost generation
number. During that time all of elmo's updates are ignored by other nodes because
they have generation numbers lower than maximum generation number already seen by those nodes for
elmo updates.
- If any node anywhere in the system mistakenly injects a record with a very
high sequence number for, say, elmo, then none of elmo's legitimate updates
will be accepted by any other node for a potentially very long time.
There are a number of ways to deal with these problems, but the approach taken in SNet
is to base generation numbers on a real-time clock. In particular, SNet uses
the clock implemented as NetBase.theNetBase().now(), which is the local system's
best guess at the number of seconds that have passed since midnight, January 1, 1970 UTC.
So long as nodes don't somehow manage to have their time estimate go backwards (even across
various kinds of failures), this solves the first problem noted above. The second problem
is solved by having all nodes refuse updates with generation numbers that are too far in the
future, relative to their own clocks.
In SNet, a node refuses an update if its
generation number is more than 10 seconds in the future, presuming that such an update
was caused by a bug.
This scheme isn't perfect. For one thing, unless the time estimates on all the nodes are
all reasonbly accurate, it will be impossible to communicate. Because systems typically
engage in what is essentially a global, distributed, clock synchronization protocol, it's not
terribly optimistic to hope for synchronization within 10 seconds these days.
The other problem is that the clock value changes only once per second. That may,
or may not, be fast enough to assign distinct generation numbers to distinct updates, even
on a single node. The fix for this is simple:
the generation number assigned to an update is the maximum of the current clock value and the
old generation number plus one.
|