% undo & selective replay # administrivia paper questions/discussion lab 1 # building systems that will fail transient failures: hardware, concurrency, ... deterministic failures: memory/logic bugs, ... attacks & human mistakes # tolerating transient failures redundancy: independent failures error-correcting codes: communication, Facebook (VLDB'13) reboot: last week's guest lecture restore & replay: redo log replication: hypervisor (SOSP'95), Mars code (first class), Project Zap (PLDI'07) microreboot (OSDI'04), shadow driver (OSDI'04) # tolerating deterministic failures cannot simply restore & replay - need to change something change environment: Rx (SOSP'05) change/drop input: Shield (SIGCOMM'04), Vigilante (SOSP'05), Bouncer (SOSP'07), SOAP (ICSE'12), SIFT (POPL'14) change code: failure-oblivious computing (OSDI'04), rescue points (ASPLOS'09) # attacks & human mistakes restore to good state & find out what went wrong & recover how to recover? compensation: cancel orders (2010 Flash Crash), free credit monitoring (UMD data breach) selective replay: what the system should have been like # selective replay (rewrite history) suppose one can go back to add/remove operations git rebase retroactive data structures (SODA'04) intrusion recovery: Operator Undo (today's paper), Taser (SOSP'05), undo computing (Retro, OSDI'10) # selective replay (different code) patch validation: delta execution (ASPLOS'09), tachyon (USENIXSEC'12), mutable replay (ASPLOS'13) intrusion recovery: undo computing (Warp, SOSP'11) info leak detection: undo computing (Rail, OSDI'14) # q what components do we trust? can we do better? what applications would be good - server, mobile, etc.? is undo & selective replay practical? how to help identify damages & recover in general?