% undo & selective replay

# administrivia

paper questions/discussion

lab 1

# building systems that will fail

transient failures: hardware, concurrency, ...

deterministic failures: memory/logic bugs, ...

attacks & human mistakes

# tolerating transient failures

redundancy: independent failures

error-correcting codes: communication, Facebook (VLDB'13)

reboot: last week's guest lecture

restore & replay: redo log

replication: hypervisor (SOSP'95), Mars code (first class), Project Zap (PLDI'07)

microreboot (OSDI'04), shadow driver (OSDI'04)

# tolerating deterministic failures

cannot simply restore & replay - need to change something

change environment: Rx (SOSP'05)

change/drop input:
Shield (SIGCOMM'04), Vigilante (SOSP'05), Bouncer (SOSP'07),
SOAP (ICSE'12), SIFT (POPL'14)

change code:
failure-oblivious computing (OSDI'04),
rescue points (ASPLOS'09)

# attacks & human mistakes

restore to good state & find out what went wrong & recover

how to recover?

compensation:
cancel orders (2010 Flash Crash),
free credit monitoring (UMD data breach)

selective replay: what the system should have been like

# selective replay (rewrite history)

suppose one can go back to add/remove operations

git rebase

retroactive data structures (SODA'04)

intrusion recovery:
Operator Undo (today's paper),
Taser (SOSP'05),
undo computing (Retro, OSDI'10)

# selective replay (different code)

patch validation:
delta execution (ASPLOS'09),
tachyon (USENIXSEC'12),
mutable replay (ASPLOS'13)

intrusion recovery:
undo computing (Warp, SOSP'11)

info leak detection:
undo computing (Rail, OSDI'14)

# q

what components do we trust? can we do better?

what applications would be good - server, mobile, etc.?

is undo & selective replay practical?

how to help identify damages & recover in general?