Exploring Failure Transparency & the Limits of Generic Recovery Lowell, Chandra, Chen Paper's Contribution: A look at the feasibility of failure transparency at the operating system level Paper's Critical Ideas: The paper's goal is to investigate the possibility of providing failure transparency to application users, implementing the failure detection and recovery at the operating system level. First they investigate what would be required to guarantee failure transparency and decide that two invariants are required. One, that an application must periodically save a sufficient amount of its state to allow for recovery and thus "the user does not discern failures". This they refer to as the "Save-Work" invariant. And two, that an application doesn't save too much of its state such that recovery will always lead back to the same failure. This they refer to as the "Lose-Work"' invariant. The paper also does some performance profiling of a handful of interactive applications that run on top of a couple run-time OS packages that attempt "stop failure" recovery, Discount Checking and DC-disk (which the authors developed). Their results are presented. Paper's Flaw(s): Of all the papers we've read so far, this by far has to be the poorest. The paper is attempting to achieve some kind of utopian operating system by having it create an illusion of failure-free operation to application users. In reality this cannot be possible, yet the authors don't mention this fact up front. Instead they make statements like "To provide this illusion, the operating system must handle all hardware, software and application failures to keep them from affecting what the user sees" There is no way the operating system could ever handle ALL hardware failures, for example suppose the CPU or motherboard was melting inside its case. Beyond that the mechanisms presented in this paper, to recover application state from "stop failures" by recovering state from some "stable storage" could itself still run into failures if say the storage hardware ran into a failure part way through committing a transaction. For example say the head of a hard drive fell off the spindle and in the process changed some of the bits in the drive. Sure you could reduce the risk via storage redundancy, say using a RAID, but total storage failure is still a possibility, and yet the authors never confront these issues. These hardware failures certainly exist even in the face of 2-phased commits. The operating system will never be able to hide all failures, or even all "stop failures". Also this paper comes up with ridiculous oxymoronic terms like, "fixed non-deterministic events" or making foolish comments such as "xpilot is able to sustain a usable 9 frames per second"... I can't recall a real-time application ever believing it could be a success and earn any money in a free-market running at only 9 fps. I really can't imagine this type of failure transparency would be all too useful for the vast majority of real-time apps. Additionally, the authors fail to consider that since not 100% of failures can be recovered from, applications will still have to have some system of fault-detection built in. And it's quite prudent for an application, when it's not certain the state of the system is correct, to notify its users so they can attempt to reason about the best method for recovery. I can't believe there are so many short-comings in a publicly published article, especially given the fact that reviewers were acknowledged, I can only surmise that the reviewers never got around to actually reading this paper. I would go on but I think the point has been made. Relevance to Modern Systems: Any system must somehow deal with unexpected failures, whether the failures are due to hardware or software faults. The more that lower-level mechanisms, such as the operating system, can detect and recover from failures, the easier it becomes for application developers to create new and more useful systems.