Exploring Failure Transparency and the Limits of Generic Recovery David E. Lowell, Subhachandra Chandra, Peter M. Chen This paper examines the possibility of providing generic failure recovery services for applications in face of both fail-stop and propagated failures. It concludes that generic recovery is not possible. The paper introduces the Save-work and Lost-work theorems that define the requirements for successful failure recovery from both fail-stop and propagated failure in terms of non-deterministic and commit events. The Save-work theorem states that consistent failure recovery from stop failures requires that each non-deterministic event that precedes a visible (external to the system) or commit events must causally precede the visible or commit event. The Lost-work theorem states that application-generic is not possible if a commit event occurs in a dangerous path, where a dangerous path is a path that can lead to a crash. Application-generic recovery is shown not to be possible because it is impossible for both the Save-work and Lost-work conditions to hold in all cases. This is proved by an example. In order to estimate the prevalence of situations that violate the combination of the Save-work and Lost-work conditions in real applications the authors ran several applications on an instrument system. The results of these experiments showed that the prevalence of conditions where application-generic recovery can not be guaranteed is high enough to make application-generic recovery unusable in general. A weakness in the paper is the focus on crash events. It does not cover failures that cause incorrect results but do not cause a crash. Incorrect results are mentioned only briefly in discussing the results of the experiments. Incorrect results, in many cases, can be just as bad as crashes, if not worse. The assumption of perfect knowledge of each process's crash events when examining dangerous paths also limits the work to examining violations of the two correctness conditions in hindsight after a crash. Because of this assumption, there can be no discussion on the possibilities of predictive analysis within a running application, or whether this is even possible. As mentioned in the paper, application-generic recovery is a very tempting goal. The results in this paper emphasize the need for applications to provide their own failure recovery mechanisms since it can not be provided transparently and correctly to them. With more users expecting failure recovery, this has big implications on application development. An interesting tidbit was the estimation of the proportion of Heisenbugs and Bohr bugs in real applications. If this estimate is valid across most software, there's a lot of room for improvement in software reliability.