Exploring Failure Transparency and the Limits of Generic Recovery
David E. Lowell, Subhachandra Chandra, Peter M. Chen


This paper examines the possibility of providing generic failure recovery
services for applications in face of both fail-stop and propagated failures.
It concludes that generic recovery is not possible.

The paper introduces the Save-work and Lost-work theorems that define the
requirements for successful failure recovery from both fail-stop and
propagated failure in terms of non-deterministic and commit events.  The
Save-work theorem states that consistent failure recovery from stop failures
requires that each non-deterministic event that precedes a visible (external
to the system) or commit events must causally precede the visible or commit
event.  The Lost-work theorem states that application-generic is not
possible if a commit event occurs in a dangerous path, where a dangerous
path is a path that can lead to a crash.

Application-generic recovery is shown not to be possible because it is
impossible for both the Save-work and Lost-work conditions to hold in all
cases.  This is proved by an example.

In order to estimate the prevalence of situations that violate the
combination of the Save-work and Lost-work conditions in real applications
the authors ran several applications on an instrument system.  The results
of these experiments showed that the prevalence of conditions where
application-generic recovery can not be guaranteed is high enough to make
application-generic recovery unusable in general.

A weakness in the paper is the focus on crash events.   It does not cover
failures that cause incorrect results but do not cause a crash.  Incorrect
results are mentioned only briefly in discussing the results of the
experiments.  Incorrect results, in many cases, can be just as bad as
crashes, if not worse.

The assumption of perfect knowledge of each process's crash events when
examining dangerous paths also limits the work to examining violations of
the two correctness conditions in hindsight after a crash.  Because of this
assumption, there can be no discussion on the possibilities of predictive
analysis within a running application, or whether this is even possible.

As mentioned in the paper, application-generic recovery is a very tempting
goal.  The results in this paper emphasize the need for applications to
provide their own failure recovery mechanisms since it can not be provided
transparently and correctly to them.  With more users expecting failure
recovery, this has big implications on application development.

An interesting tidbit was the estimation of the proportion of Heisenbugs and
Bohr bugs in real applications.  If this estimate is valid across most
software, there's a lot of room for improvement in software reliability.