Section 2.5 -- the business of non-determinism helping seems mildly insane, because they are saying that we might be better off losing the same state that we want to save because we can't get it back. I suppose the real point is that we can do this if visible events are rare, so that we can run as long as possible saving nothing so that we get a different execution in the event of failure. Exploring Failure Transparency and the Limits of Generic Recovery Lowell, Chandra, and Chen This paper focuses on the challenge of transparently recovering from software faults in both centralized and distributed systems. The basic technique is checkpointing (saving all application state) and restarting (recovery by jumping back in time to the last checkpoint, and replaying the application's actions from that point). Although the application is thus somewhat involved in the process, the techniques described are fully automated, so that one could imagine either inserting them at compilation or having the OS stop program execution (potentially after each operation) to perform the required actions. For the sake of non-circularity, I'll hit on weaknesses first, and strenths later. As I see it, there are two major weaknesses, one in the work itself and one in the presentation. From a presentation point of view, they completely ignore the question of whether transparency is the right goal in the first place. Clearly transparency makes sense for dealing with program bugs (if the programmer were aware of the problem, there wouldn't be a problem) and for truly transient errors, but for many sorts of environmental problems there are many ways a program could react better than backing out and trying again. The authors seem to view overhead as the only reason to avoid transparency when it might be achievable. As for the work itself, my only real concern is that the assumption that it is acceptable to duplicate visible events is overly restrictive. If the application is dealing directly with the user I would expect this to be a serious UI issue, especially if the duplicates can be widely spaced in time. I understand that the output can be tagged so that duplicates can be recognized, and that this typically solves the issue if the reader is another computer application, but I imagine that a human user would demand a filter to suppress these duplicates, and then that filter will itself become a source of failure (especially since it has to remember pretty much everything that has been passed to the user). The important thing that I learned reading this paper was a far better appreciation for the balance between exact recovery and flexibility to avoid recurring failures. I find the notion that it is sometimes better to lose a great deal of work with no hope of perfect recovery, because a more precise system could become hopelessly stuck. The distinction between fixed and transient non-determinism is an interesting way of looking at the problem. It seems that you could take this one step further, however, by having the system attempt to break fixed non-determinism, particularly after a failed recovery. This would mean pushing back (either to the user, to the OS, or to other processes) the information that a fixed transient event is breaking the recovery process, so that the party that originated the event could consider altering the computation to avoid the dangerous path. This of course ties back to the question of whether transparency is the right goal, since it throws transparency out the window in the hope of exposing additional flexibility that can be used to route around otherwise insurmountable obstacles. The other contribution is to better pin down why it is that propagation failures are so much harder to deal with than fail-stop systems. What is interesting is that it has little to do with being able to trace back from the crash to the cause of the problem (which is why it is hard for *programmers* to deal with these faults) and more to do with the increased chance of the failure becoming ingrained in the "correct" backup system state before the problem manifests itself. I wonder if a more complete history would help (if at first you don't succeed, try backing off to an older checkpoint). Note that this again would violate some of the guarantees that the save-work invariant gives us. Relevance today: all sorts, (not surprising for a recent paper). Designing zero-failure systems is hard, so understanding how to build a failure-recovery system is the buggy-program equivalent to attacking hardware faults through replication in distributed systems. Make the system do what the individual components (in this case the application itself) cannot.