Section 2.5 -- the business of non-determinism helping seems mildly
insane, because they are saying that we might be better off losing the
same state that we want to save because we can't get it back.  I
suppose the real point is that we can do this if visible events are
rare, so that we can run as long as possible saving nothing so that we
get a different execution in the event of failure.


Exploring Failure Transparency and the Limits of Generic Recovery
Lowell, Chandra, and Chen


This paper focuses on the challenge of transparently recovering from
software faults in both centralized and distributed systems.  The
basic technique is checkpointing (saving all application state) and
restarting (recovery by jumping back in time to the last checkpoint,
and replaying the application's actions from that point).  Although
the application is thus somewhat involved in the process, the
techniques described are fully automated, so that one could imagine
either inserting them at compilation or having the OS stop program
execution (potentially after each operation) to perform the required
actions.

For the sake of non-circularity, I'll hit on weaknesses first, and
strenths later.

As I see it, there are two major weaknesses, one in the work itself
and one in the presentation.  From a presentation point of view, they
completely ignore the question of whether transparency is the right
goal in the first place.  Clearly transparency makes sense for dealing
with program bugs (if the programmer were aware of the problem, there
wouldn't be a problem) and for truly transient errors, but for many
sorts of environmental problems there are many ways a program could
react better than backing out and trying again.  The authors seem to
view overhead as the only reason to avoid transparency when it might
be achievable.

As for the work itself, my only real concern is that the assumption
that it is acceptable to duplicate visible events is overly
restrictive.  If the application is dealing directly with the user I
would expect this to be a serious UI issue, especially if the
duplicates can be widely spaced in time.  I understand that the output
can be tagged so that duplicates can be recognized, and that this
typically solves the issue if the reader is another computer
application, but I imagine that a human user would demand a filter to
suppress these duplicates, and then that filter will itself become a
source of failure (especially since it has to remember pretty much
everything that has been passed to the user).


The important thing that I learned reading this paper was a far better
appreciation for the balance between exact recovery and flexibility to
avoid recurring failures.  I find the notion that it is sometimes
better to lose a great deal of work with no hope of perfect recovery,
because a more precise system could become hopelessly stuck.  The
distinction between fixed and transient non-determinism is an
interesting way of looking at the problem.

It seems that you could take this one step further, however, by having
the system attempt to break fixed non-determinism, particularly after
a failed recovery.  This would mean pushing back (either to the user,
to the OS, or to other processes) the information that a fixed
transient event is breaking the recovery process, so that the party
that originated the event could consider altering the computation to
avoid the dangerous path.  This of course ties back to the question of
whether transparency is the right goal, since it throws transparency
out the window in the hope of exposing additional flexibility that can
be used to route around otherwise insurmountable obstacles.

The other contribution is to better pin down why it is that
propagation failures are so much harder to deal with than fail-stop
systems.  What is interesting is that it has little to do with being
able to trace back from the crash to the cause of the problem (which
is why it is hard for *programmers* to deal with these faults) and
more to do with the increased chance of the failure becoming ingrained
in the "correct" backup system state before the problem manifests
itself.  I wonder if a more complete history would help (if at first
you don't succeed, try backing off to an older checkpoint).  Note that
this again would violate some of the guarantees that the save-work
invariant gives us.


Relevance today: all sorts, (not surprising for a recent paper).
Designing zero-failure systems is hard, so understanding how to build
a failure-recovery system is the buggy-program equivalent to attacking
hardware faults through replication in distributed systems.  Make the
system do what the individual components (in this case the application
itself) cannot.