Exploring Failure Transparency & the Limits of Generic Recovery
Lowell, Chandra, Chen

Paper's Contribution:

A look at the feasibility of failure transparency at the operating
system level

Paper's Critical Ideas:

The paper's goal is to investigate the possibility of providing
failure transparency to application users, implementing the failure
detection and recovery at the operating system level.  First they
investigate what would be required to guarantee failure transparency
and decide that two invariants are required.  One, that an application
must periodically save a sufficient amount of its state to allow for
recovery and thus "the user does not discern failures".  This they
refer to as the "Save-Work" invariant.  And two, that an application
doesn't save too much of its state such that recovery will always lead
back to the same failure.  This they refer to as the "Lose-Work"'
invariant.

The paper also does some performance profiling of a handful of
interactive applications that run on top of a couple run-time OS
packages that attempt "stop failure" recovery, Discount Checking and
DC-disk (which the authors developed).  Their results are presented.

Paper's Flaw(s):

Of all the papers we've read so far, this by far has to be the
poorest.  The paper is attempting to achieve some kind of utopian
operating system by having it create an illusion of failure-free
operation to application users.  In reality this cannot be possible,
yet the authors don't mention this fact up front.  Instead they make
statements like "To provide this illusion, the operating system must
handle all hardware, software and application failures to keep them
from affecting what the user sees" There is no way the operating
system could ever handle ALL hardware failures, for example suppose
the CPU or motherboard was melting inside its case.  Beyond that the
mechanisms presented in this paper, to recover application state from
"stop failures" by recovering state from some "stable storage" could
itself still run into failures if say the storage hardware ran into a
failure part way through committing a transaction.  For example say
the head of a hard drive fell off the spindle and in the process
changed some of the bits in the drive.  Sure you could reduce the risk
via storage redundancy, say using a RAID, but total storage failure is
still a possibility, and yet the authors never confront these issues.
These hardware failures certainly exist even in the face of 2-phased
commits.  The operating system will never be able to hide all
failures, or even all "stop failures".

Also this paper comes up with ridiculous oxymoronic terms like, "fixed
non-deterministic events" or making foolish comments such as "xpilot
is able to sustain a usable 9 frames per second"... I can't recall a
real-time application ever believing it could be a success and earn
any money in a free-market running at only 9 fps.  I really can't
imagine this type of failure transparency would be all too useful for
the vast majority of real-time apps.  Additionally, the authors fail
to consider that since not 100% of failures can be recovered from,
applications will still have to have some system of fault-detection
built in.  And it's quite prudent for an application, when it's not
certain the state of the system is correct, to notify its users so
they can attempt to reason about the best method for recovery.

I can't believe there are so many short-comings in a publicly
published article, especially given the fact that reviewers were
acknowledged, I can only surmise that the reviewers never got around
to actually reading this paper.  I would go on but I think the point
has been made.

Relevance to Modern Systems:

Any system must somehow deal with unexpected failures, whether the
failures are due to hardware or software faults.  The more that
lower-level mechanisms, such as the operating system, can detect and
recover from failures, the easier it becomes for application
developers to create new and more useful systems.